Loading targets from other species in a GFF file into Chado

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Loading targets from other species in a GFF file into Chado

Stephen Ficklin-2
Hi all,

My apologies if you got this twice. I meant to send it to the
gmod-schema list, but accidentally sent it first to the gmod-tripal list.

We are trying to load a GFF file into Chado using the Tripal loader.  
The file is full of alignments using the 'match' and 'match_part'
feature types.  The landmark sequence belongs to a chromosomal sequence
from one species but the sequences that were aligned belong to another
closely related species.  We have both sequences stored in Chado and
want to store the alignment not just between the new "match" and
"match_part" features but also the target feature. Below are a few
example lines from the GFF file.

LG2     GDR     match   22139177        22140819        100.00 -      
.       ID=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 1479 1
LG2     GDR     match_part      22139177        22139353 . -       .
Parent=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 1479 222
LG2     GDR     match_part      22139435        22139478 . -       .
Parent=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 221 178
LG2     GDR     match_part      22139562        22140819 . -       .
Parent=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 177 1

In these lines, LG2 belongs to the species 'Fragaria vesca' and the NCBI
sequence 'JX013940.1-SUT4' belongs to the species 'Fragaria x
ananassa'.  When loading the GFF the new match and match_part features
get associated with the Fragaria vesca organism but the loader complains
that it can't find the target feature 'JX013940.1-SUT4'.  This is
because it belongs to another organism. So, the loader can't simply
query the Chado database to find the Target because we can't guarantee
the name is unique across all organisms.

One idea to get around this is to allow for a new attribute
'Target_organism=[Genus],[species]' where the genus and species can be
specified.  Does this sound reasonable? Or is there a better suggestion?

Thanks,
Stephen

------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Loading targets from other species in a GFF file into Chado

Siddhartha Basu
On Tue, 12 Feb 2013, Stephen Ficklin wrote:

> Hi all,
>
> My apologies if you got this twice. I meant to send it to the
> gmod-schema list, but accidentally sent it first to the gmod-tripal list.
>
> We are trying to load a GFF file into Chado using the Tripal loader.  
> The file is full of alignments using the 'match' and 'match_part'
> feature types.  The landmark sequence belongs to a chromosomal sequence
> from one species but the sequences that were aligned belong to another
> closely related species.  We have both sequences stored in Chado and
> want to store the alignment not just between the new "match" and
> "match_part" features but also the target feature. Below are a few
> example lines from the GFF file.
>
> LG2     GDR     match   22139177        22140819        100.00 -      
> .       ID=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 1479 1
> LG2     GDR     match_part      22139177        22139353 . -       .
> Parent=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 1479 222
> LG2     GDR     match_part      22139435        22139478 . -       .
> Parent=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 221 178
> LG2     GDR     match_part      22139562        22140819 . -       .
> Parent=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 177 1
>
> In these lines, LG2 belongs to the species 'Fragaria vesca' and the NCBI
> sequence 'JX013940.1-SUT4' belongs to the species 'Fragaria x
> ananassa'.  When loading the GFF the new match and match_part features
> get associated with the Fragaria vesca organism but the loader complains
> that it can't find the target feature 'JX013940.1-SUT4'.  This is
> because it belongs to another organism. So, the loader can't simply
> query the Chado database to find the Target because we can't guarantee
> the name is unique across all organisms.
>
> One idea to get around this is to allow for a new attribute
> 'Target_organism=[Genus],[species]' where the genus and species can be
> specified.  Does this sound reasonable? Or is there a better suggestion?
You might also consider accepting target_organism parameter in the
loader.

thanks,
-siddhartha

>
> Thanks,
> Stephen
>
> ------------------------------------------------------------------------------
> Free Next-Gen Firewall Hardware Offer
> Buy your Sophos next-gen firewall before the end March 2013
> and get the hardware for free! Learn more.
> http://p.sf.net/sfu/sophos-d2d-feb
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema

------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Loading targets from other species in a GFF file into Chado

Scott Cain
Hi Stephen and Siddhartha,

The perl bulk loader supports specifying the organism of a line of gff
with the attribute "organism" (though looking now, this feature isn't
documented in the perldoc!).  I think this could be extended to
target_organism for match/match_part features.  I'd rather not use the
capital T, since that would violate the GFF3 spec.  Since you can also
specify the organism at the command line level, it makes sense to also
support specifying the target organims in both the perl and Tripal
loaders too, so Siddhartha's suggestion is good as well (then you
don't have to put an attribute on every line of the GFF file if they
are all going to be the same anyway).

Scott


On Wed, Feb 13, 2013 at 11:25 AM, Siddhartha Basu <[hidden email]> wrote:

> On Tue, 12 Feb 2013, Stephen Ficklin wrote:
>
>> Hi all,
>>
>> My apologies if you got this twice. I meant to send it to the
>> gmod-schema list, but accidentally sent it first to the gmod-tripal list.
>>
>> We are trying to load a GFF file into Chado using the Tripal loader.
>> The file is full of alignments using the 'match' and 'match_part'
>> feature types.  The landmark sequence belongs to a chromosomal sequence
>> from one species but the sequences that were aligned belong to another
>> closely related species.  We have both sequences stored in Chado and
>> want to store the alignment not just between the new "match" and
>> "match_part" features but also the target feature. Below are a few
>> example lines from the GFF file.
>>
>> LG2     GDR     match   22139177        22140819        100.00 -
>> .       ID=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 1479 1
>> LG2     GDR     match_part      22139177        22139353 . -       .
>> Parent=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 1479 222
>> LG2     GDR     match_part      22139435        22139478 . -       .
>> Parent=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 221 178
>> LG2     GDR     match_part      22139562        22140819 . -       .
>> Parent=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 177 1
>>
>> In these lines, LG2 belongs to the species 'Fragaria vesca' and the NCBI
>> sequence 'JX013940.1-SUT4' belongs to the species 'Fragaria x
>> ananassa'.  When loading the GFF the new match and match_part features
>> get associated with the Fragaria vesca organism but the loader complains
>> that it can't find the target feature 'JX013940.1-SUT4'.  This is
>> because it belongs to another organism. So, the loader can't simply
>> query the Chado database to find the Target because we can't guarantee
>> the name is unique across all organisms.
>>
>> One idea to get around this is to allow for a new attribute
>> 'Target_organism=[Genus],[species]' where the genus and species can be
>> specified.  Does this sound reasonable? Or is there a better suggestion?
> You might also consider accepting target_organism parameter in the
> loader.
>
> thanks,
> -siddhartha
>
>>
>> Thanks,
>> Stephen
>>
>> ------------------------------------------------------------------------------
>> Free Next-Gen Firewall Hardware Offer
>> Buy your Sophos next-gen firewall before the end March 2013
>> and get the hardware for free! Learn more.
>> http://p.sf.net/sfu/sophos-d2d-feb
>> _______________________________________________
>> Gmod-schema mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>
> ------------------------------------------------------------------------------
> Free Next-Gen Firewall Hardware Offer
> Buy your Sophos next-gen firewall before the end March 2013
> and get the hardware for free! Learn more.
> http://p.sf.net/sfu/sophos-d2d-feb
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Loading targets from other species in a GFF file into Chado

Stephen Ficklin-2
Hi Scott and Sidhartha,

Thanks for your responses.

Oh, I wasn't aware of the 'organism' attribute.  That's good to know. So
I think what I'll do is the following:

1.  Add a parameter to the Tripal loader to let folks optionally select
the target organism and target feature type (since the unique constraint
for a feature includes both).
2.  For cases where the GFF file is full of Targets to various species
the loader can support the 'target_organism' and 'target_type'
attributes when the 'Target' attribute is present.

Stephen

On 2/13/2013 11:47 AM, Scott Cain wrote:

> Hi Stephen and Siddhartha,
>
> The perl bulk loader supports specifying the organism of a line of gff
> with the attribute "organism" (though looking now, this feature isn't
> documented in the perldoc!).  I think this could be extended to
> target_organism for match/match_part features.  I'd rather not use the
> capital T, since that would violate the GFF3 spec.  Since you can also
> specify the organism at the command line level, it makes sense to also
> support specifying the target organims in both the perl and Tripal
> loaders too, so Siddhartha's suggestion is good as well (then you
> don't have to put an attribute on every line of the GFF file if they
> are all going to be the same anyway).
>
> Scott
>
>
> On Wed, Feb 13, 2013 at 11:25 AM, Siddhartha Basu <[hidden email]> wrote:
>> On Tue, 12 Feb 2013, Stephen Ficklin wrote:
>>
>>> Hi all,
>>>
>>> My apologies if you got this twice. I meant to send it to the
>>> gmod-schema list, but accidentally sent it first to the gmod-tripal list.
>>>
>>> We are trying to load a GFF file into Chado using the Tripal loader.
>>> The file is full of alignments using the 'match' and 'match_part'
>>> feature types.  The landmark sequence belongs to a chromosomal sequence
>>> from one species but the sequences that were aligned belong to another
>>> closely related species.  We have both sequences stored in Chado and
>>> want to store the alignment not just between the new "match" and
>>> "match_part" features but also the target feature. Below are a few
>>> example lines from the GFF file.
>>>
>>> LG2     GDR     match   22139177        22140819        100.00 -
>>> .       ID=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 1479 1
>>> LG2     GDR     match_part      22139177        22139353 . -       .
>>> Parent=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 1479 222
>>> LG2     GDR     match_part      22139435        22139478 . -       .
>>> Parent=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 221 178
>>> LG2     GDR     match_part      22139562        22140819 . -       .
>>> Parent=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 177 1
>>>
>>> In these lines, LG2 belongs to the species 'Fragaria vesca' and the NCBI
>>> sequence 'JX013940.1-SUT4' belongs to the species 'Fragaria x
>>> ananassa'.  When loading the GFF the new match and match_part features
>>> get associated with the Fragaria vesca organism but the loader complains
>>> that it can't find the target feature 'JX013940.1-SUT4'.  This is
>>> because it belongs to another organism. So, the loader can't simply
>>> query the Chado database to find the Target because we can't guarantee
>>> the name is unique across all organisms.
>>>
>>> One idea to get around this is to allow for a new attribute
>>> 'Target_organism=[Genus],[species]' where the genus and species can be
>>> specified.  Does this sound reasonable? Or is there a better suggestion?
>> You might also consider accepting target_organism parameter in the
>> loader.
>>
>> thanks,
>> -siddhartha
>>
>>> Thanks,
>>> Stephen
>>>
>>> ------------------------------------------------------------------------------
>>> Free Next-Gen Firewall Hardware Offer
>>> Buy your Sophos next-gen firewall before the end March 2013
>>> and get the hardware for free! Learn more.
>>> http://p.sf.net/sfu/sophos-d2d-feb
>>> _______________________________________________
>>> Gmod-schema mailing list
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>> ------------------------------------------------------------------------------
>> Free Next-Gen Firewall Hardware Offer
>> Buy your Sophos next-gen firewall before the end March 2013
>> and get the hardware for free! Learn more.
>> http://p.sf.net/sfu/sophos-d2d-feb
>> _______________________________________________
>> Gmod-schema mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>
>


------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Loading targets from other species in a GFF file into Chado

Siddhartha Basu
On Wed, 13 Feb 2013, Stephen Ficklin wrote:

> Hi Scott and Sidhartha,
>
> Thanks for your responses.
>
> Oh, I wasn't aware of the 'organism' attribute.  That's good to know. So
> I think what I'll do is the following:
>
> 1.  Add a parameter to the Tripal loader to let folks optionally select
> the target organism and target feature type (since the unique constraint
> for a feature includes both).
> 2.  For cases where the GFF file is full of Targets to various species
> the loader can support the 'target_organism' and 'target_type'
> attributes when the 'Target' attribute is present.
Great!!! makes sense to me.

thanks,
-siddhartha

>
> Stephen
>
> On 2/13/2013 11:47 AM, Scott Cain wrote:
> > Hi Stephen and Siddhartha,
> >
> > The perl bulk loader supports specifying the organism of a line of gff
> > with the attribute "organism" (though looking now, this feature isn't
> > documented in the perldoc!).  I think this could be extended to
> > target_organism for match/match_part features.  I'd rather not use the
> > capital T, since that would violate the GFF3 spec.  Since you can also
> > specify the organism at the command line level, it makes sense to also
> > support specifying the target organims in both the perl and Tripal
> > loaders too, so Siddhartha's suggestion is good as well (then you
> > don't have to put an attribute on every line of the GFF file if they
> > are all going to be the same anyway).
> >
> > Scott
> >
> >
> > On Wed, Feb 13, 2013 at 11:25 AM, Siddhartha Basu <[hidden email]> wrote:
> >> On Tue, 12 Feb 2013, Stephen Ficklin wrote:
> >>
> >>> Hi all,
> >>>
> >>> My apologies if you got this twice. I meant to send it to the
> >>> gmod-schema list, but accidentally sent it first to the gmod-tripal list.
> >>>
> >>> We are trying to load a GFF file into Chado using the Tripal loader.
> >>> The file is full of alignments using the 'match' and 'match_part'
> >>> feature types.  The landmark sequence belongs to a chromosomal sequence
> >>> from one species but the sequences that were aligned belong to another
> >>> closely related species.  We have both sequences stored in Chado and
> >>> want to store the alignment not just between the new "match" and
> >>> "match_part" features but also the target feature. Below are a few
> >>> example lines from the GFF file.
> >>>
> >>> LG2     GDR     match   22139177        22140819        100.00 -
> >>> .       ID=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 1479 1
> >>> LG2     GDR     match_part      22139177        22139353 . -       .
> >>> Parent=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 1479 222
> >>> LG2     GDR     match_part      22139435        22139478 . -       .
> >>> Parent=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 221 178
> >>> LG2     GDR     match_part      22139562        22140819 . -       .
> >>> Parent=JX013940.1-SUT4_mid1;Target=JX013940.1-SUT4 177 1
> >>>
> >>> In these lines, LG2 belongs to the species 'Fragaria vesca' and the NCBI
> >>> sequence 'JX013940.1-SUT4' belongs to the species 'Fragaria x
> >>> ananassa'.  When loading the GFF the new match and match_part features
> >>> get associated with the Fragaria vesca organism but the loader complains
> >>> that it can't find the target feature 'JX013940.1-SUT4'.  This is
> >>> because it belongs to another organism. So, the loader can't simply
> >>> query the Chado database to find the Target because we can't guarantee
> >>> the name is unique across all organisms.
> >>>
> >>> One idea to get around this is to allow for a new attribute
> >>> 'Target_organism=[Genus],[species]' where the genus and species can be
> >>> specified.  Does this sound reasonable? Or is there a better suggestion?
> >> You might also consider accepting target_organism parameter in the
> >> loader.
> >>
> >> thanks,
> >> -siddhartha
> >>
> >>> Thanks,
> >>> Stephen
> >>>
> >>> ------------------------------------------------------------------------------
> >>> Free Next-Gen Firewall Hardware Offer
> >>> Buy your Sophos next-gen firewall before the end March 2013
> >>> and get the hardware for free! Learn more.
> >>> http://p.sf.net/sfu/sophos-d2d-feb
> >>> _______________________________________________
> >>> Gmod-schema mailing list
> >>> [hidden email]
> >>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
> >> ------------------------------------------------------------------------------
> >> Free Next-Gen Firewall Hardware Offer
> >> Buy your Sophos next-gen firewall before the end March 2013
> >> and get the hardware for free! Learn more.
> >> http://p.sf.net/sfu/sophos-d2d-feb
> >> _______________________________________________
> >> Gmod-schema mailing list
> >> [hidden email]
> >> https://lists.sourceforge.net/lists/listinfo/gmod-schema
> >
> >
>
>
> ------------------------------------------------------------------------------
> Free Next-Gen Firewall Hardware Offer
> Buy your Sophos next-gen firewall before the end March 2013
> and get the hardware for free! Learn more.
> http://p.sf.net/sfu/sophos-d2d-feb
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema

------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema