Loading GFF3 with CDS with identical ID

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Loading GFF3 with CDS with identical ID

mdhar
Hello,

We're attempting to load a GFF3 file into Tripal, and have noticed that it will not accept multiple CDS features with the same ID. This post (http://gmod.827538.n3.nabble.com/Problem-loading-GFF-from-maker-into-Tripal-2-0-td4050189.html) suggests that this issue was addressed in a development patch in 2015. We are using the stable Tripal version 2.0, Chado version 1.23. Has the update for multiple CDS that have the same ID not been made to Tripal version 2.0? Is there any other way, or patch, to address this?

Thanks very much.

Sincerely,
Michael 

------------------------------------------------------------------------------

_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: Loading GFF3 with CDS with identical ID

Stephen Ficklin-2

Hi Michael,

Thanks for sending along the issue. Are you getting that exact same error?  Namely: "chado_insert_record; Cannot insert duplicate record into tripal_gffcds_temp table"?

I think there are actually two issues in that posting. One was the issue I fixed with the temp table, and the second was with the GFF3 file itself.  We never resolved the second issue because there was not a follow up post and I failed to recognize it at the time....   So, I think the issue with the temp table is fixed, but the second problem is that it's not actually possible for two features in a GFF3 file to have the same ID.  As per the GFF3 standard on the GMOD wiki (http://gmod.org/wiki/GFF3) under "Column 9 tags"

"ID: Indicates the unique identifier of the feature. IDs must be unique within the scope of the GFF file." 

Additionally, there is an integrity constraint on the Chado feature table that all IDs must be unique within the organism and feature type (e.g. CDS).  So it's not possible to load two CDS's for the same organism that have the exact same ID.  The integrity constraints on the table will cause the import to fail. Even if we adjusted the Tripal GFF loader to ignore CDSs with the same ID the Chado feature table constraints will block the insert.  

You might want to check with the gene prediction software creators (is it maker?) to make sure this isn't a bug in the GFF3 export.   But, a relatively simple fix would be to append the fmin position in the GFF file to the end of each CDS ID and that will give you a unique ID.  A little perl or python script should be able to do this for you.

Stephen


On 9/7/2016 7:28 AM, Michael Dhar wrote:
Hello,

We're attempting to load a GFF3 file into Tripal, and have noticed that it will not accept multiple CDS features with the same ID. This post (http://gmod.827538.n3.nabble.com/Problem-loading-GFF-from-maker-into-Tripal-2-0-td4050189.html) suggests that this issue was addressed in a development patch in 2015. We are using the stable Tripal version 2.0, Chado version 1.23. Has the update for multiple CDS that have the same ID not been made to Tripal version 2.0? Is there any other way, or patch, to address this?

Thanks very much.

Sincerely,
Michael 


------------------------------------------------------------------------------


_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal


------------------------------------------------------------------------------

_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: Loading GFF3 with CDS with identical ID

Nathan S. Watson-Haigh

Hi Stephen,


I read your response regarding unique GFF3 ID's within the scope of a particular GFF3 file. What you stated isn't strictly correct: discontinuous features occupy a different line in the GFF3 file, but DO have the same ID:

http://gmod.org/wiki/GFF3#Discontinuous_Features


However, it does also state that this GFF3 feature is not supported by the

GMOD Chado bulk GFF3 loader. So, I think multiple identical ID's are valid GFF3 but Chado doesn't support them. Although Parent-child relationships would probably be a better way to group features which do have unique IDs.


Sorry for chiming in with something not too helpful for the OP!


Cheers,

Nathan



From: Stephen Ficklin <[hidden email]>
Sent: Thursday, 8 September 2016 9:01 AM
To: [hidden email]
Subject: Re: [Gmod-tripal] Loading GFF3 with CDS with identical ID
 

Hi Michael,

Thanks for sending along the issue. Are you getting that exact same error?  Namely: "chado_insert_record; Cannot insert duplicate record into tripal_gffcds_temp table"?

I think there are actually two issues in that posting. One was the issue I fixed with the temp table, and the second was with the GFF3 file itself.  We never resolved the second issue because there was not a follow up post and I failed to recognize it at the time....   So, I think the issue with the temp table is fixed, but the second problem is that it's not actually possible for two features in a GFF3 file to have the same ID.  As per the GFF3 standard on the GMOD wiki (http://gmod.org/wiki/GFF3) under "Column 9 tags"

"ID: Indicates the unique identifier of the feature. IDs must be unique within the scope of the GFF file." 

Additionally, there is an integrity constraint on the Chado feature table that all IDs must be unique within the organism and feature type (e.g. CDS).  So it's not possible to load two CDS's for the same organism that have the exact same ID.  The integrity constraints on the table will cause the import to fail. Even if we adjusted the Tripal GFF loader to ignore CDSs with the same ID the Chado feature table constraints will block the insert.  

You might want to check with the gene prediction software creators (is it maker?) to make sure this isn't a bug in the GFF3 export.   But, a relatively simple fix would be to append the fmin position in the GFF file to the end of each CDS ID and that will give you a unique ID.  A little perl or python script should be able to do this for you.

Stephen


On 9/7/2016 7:28 AM, Michael Dhar wrote:
Hello,

We're attempting to load a GFF3 file into Tripal, and have noticed that it will not accept multiple CDS features with the same ID. This post (http://gmod.827538.n3.nabble.com/Problem-loading-GFF-from-maker-into-Tripal-2-0-td4050189.html) suggests that this issue was addressed in a development patch in 2015. We are using the stable Tripal version 2.0, Chado version 1.23. Has the update for multiple CDS that have the same ID not been made to Tripal version 2.0? Is there any other way, or patch, to address this?

Thanks very much.

Sincerely,
Michael 


------------------------------------------------------------------------------


_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal

-- 
ACPFG, Adelaide Australia. This email was Virus checked by Astaro Security Gateway. http://www.sophos.com

------------------------------------------------------------------------------

_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: Loading GFF3 with CDS with identical ID

Stephen Ficklin-2
In reply to this post by mdhar
Hi Nathan

Thank  you  for  chiming in!  I do appreciate the correction.  I hate to give bad info.  You are right it is proper format for the GFF3 file to have discontinuous features.    I had forgotten about them. The nested features are definitely easier to deal with.  I can see a few challenges trying to get discontinuous features into Chado.  I've added an issue to our issue tracker to remind us to think on it. 

https://www.drupal.org/node/2796857

So thanks much. 

Michael, 

In the meantime I think editing the file in the way I suggested (if your CDSs  have a parent attribute..  which is not required for discontinuous features) then it would allow you to load the file. 

Thanks 
Stephen






-------- Original message --------
From: "Nathan S. Watson-Haigh" <[hidden email]>
Date: 9/7/16 4:58 PM (GMT-08:00)
To: Stephen Ficklin <[hidden email]>, [hidden email]
Subject: Re: [Gmod-tripal] Loading GFF3 with CDS with identical ID

Hi Stephen,


I read your response regarding unique GFF3 ID's within the scope of a particular GFF3 file. What you stated isn't strictly correct: discontinuous features occupy a different line in the GFF3 file, but DO have the same ID:

http://gmod.org/wiki/GFF3#Discontinuous_Features


However, it does also state that this GFF3 feature is not supported by the

GMOD Chado bulk GFF3 loader. So, I think multiple identical ID's are valid GFF3 but Chado doesn't support them. Although Parent-child relationships would probably be a better way to group features which do have unique IDs.


Sorry for chiming in with something not too helpful for the OP!


Cheers,

Nathan



From: Stephen Ficklin <[hidden email]>
Sent: Thursday, 8 September 2016 9:01 AM
To: [hidden email]
Subject: Re: [Gmod-tripal] Loading GFF3 with CDS with identical ID
 

Hi Michael,

Thanks for sending along the issue. Are you getting that exact same error?  Namely: "chado_insert_record; Cannot insert duplicate record into tripal_gffcds_temp table"?

I think there are actually two issues in that posting. One was the issue I fixed with the temp table, and the second was with the GFF3 file itself.  We never resolved the second issue because there was not a follow up post and I failed to recognize it at the time....   So, I think the issue with the temp table is fixed, but the second problem is that it's not actually possible for two features in a GFF3 file to have the same ID.  As per the GFF3 standard on the GMOD wiki (http://gmod.org/wiki/GFF3) under "Column 9 tags"

"ID: Indicates the unique identifier of the feature. IDs must be unique within the scope of the GFF file." 

Additionally, there is an integrity constraint on the Chado feature table that all IDs must be unique within the organism and feature type (e.g. CDS).  So it's not possible to load two CDS's for the same organism that have the exact same ID.  The integrity constraints on the table will cause the import to fail. Even if we adjusted the Tripal GFF loader to ignore CDSs with the same ID the Chado feature table constraints will block the insert.  

You might want to check with the gene prediction software creators (is it maker?) to make sure this isn't a bug in the GFF3 export.   But, a relatively simple fix would be to append the fmin position in the GFF file to the end of each CDS ID and that will give you a unique ID.  A little perl or python script should be able to do this for you.

Stephen


On 9/7/2016 7:28 AM, Michael Dhar wrote:
Hello,

We're attempting to load a GFF3 file into Tripal, and have noticed that it will not accept multiple CDS features with the same ID. This post (http://gmod.827538.n3.nabble.com/Problem-loading-GFF-from-maker-into-Tripal-2-0-td4050189.html) suggests that this issue was addressed in a development patch in 2015. We are using the stable Tripal version 2.0, Chado version 1.23. Has the update for multiple CDS that have the same ID not been made to Tripal version 2.0? Is there any other way, or patch, to address this?

Thanks very much.

Sincerely,
Michael 


------------------------------------------------------------------------------


_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal

-- 
ACPFG, Adelaide Australia. This email was Virus checked by Astaro Security Gateway. http://www.sophos.com

------------------------------------------------------------------------------

_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: Loading GFF3 with CDS with identical ID

Michael Dondrup-3
Hi,

I think one can use the standard chado importer in the meantime: gmod_bulk_load_gff3.pl
At least this worked fine for us with the salmon louse genome which is coming from the Ensembl
annotation file. I generates polypeptide sequences named polypeptide-auto… which seemed to
have the correct AA sequence. The only problem is that it is hard to rename the polypeptide-auto entries and the importer doesn’t use the CDS id, but it did not complain either.

 One of our gene models looks like this:

LSalAtl2s1  ensembl protein_coding_gene 169164  204974  .   +   .   ID=EMLSAG00000003688;biotype=protein_coding;description=maker-LSalAtl2s1-augustus-gene-2.21;logic_name=ensemblgenomes
LSalAtl2s1  ensembl transcript  169164  204974  .   +   .   ID=EMLSAT00000003688;Parent=EMLSAG00000003688;biotype=protein_coding;logic_name=ensemblgenomes
LSalAtl2s1  .   exon    169164  169208  .   +   .   ID=EMLSAE00000016048;Parent=EMLSAT00000003688;ensembl_end_phase=-1;ensembl_phase=-1;rank=1
# LSalAtl2s1    .   five_prime_UTR  169164  188833  .   +   .   Parent=EMLSAT00000003688;
LSalAtl2s1  .   five_prime_UTR  169164  188833  .   +   .   Parent=EMLSAT00000003688;ID=EMLSAT00000003688-five_prime_UTR
# LSalAtl2s1    .   CDS 188834  189007  .   +   0   Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=1
LSalAtl2s1  .   CDS 188834  189007  .   +   0   Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=1;ID=EMLSAP00000003688-CDS
# LSalAtl2s1    .   CDS 189059  189382  .   +   0   Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=2
LSalAtl2s1  .   CDS 189059  189382  .   +   0   Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=2;ID=EMLSAP00000003688-CDS
# LSalAtl2s1    .   CDS 198618  198759  .   +   0   Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=3
LSalAtl2s1  .   CDS 198618  198759  .   +   0   Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=3;ID=EMLSAP00000003688-CDS
# LSalAtl2s1    .   CDS 204808  204974  .   +   0   Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=4
LSalAtl2s1  .   CDS 204808  204974  .   +   0   Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=4;ID=EMLSAP00000003688-CDS

Michael D.

Michael Dondrup
Researcher
Sea Lice Research Centre
Department of Informatics
University of Bergen
Thormøhlensgate 55, N-5008 Bergen,
Norway

> On Sep 8, 2016, at 3:37 AM, spficklin <[hidden email]> wrote:
>
> Hi Nathan
>
> Thank  you  for  chiming in!  I do appreciate the correction.  I hate to give bad info.  You are right it is proper format for the GFF3 file to have discontinuous features.    I had forgotten about them. The nested features are definitely easier to deal with.  I can see a few challenges trying to get discontinuous features into Chado.  I've added an issue to our issue tracker to remind us to think on it.
>
> https://www.drupal.org/node/2796857
>
> So thanks much.
>
> Michael,
>
> In the meantime I think editing the file in the way I suggested (if your CDSs  have a parent attribute..  which is not required for discontinuous features) then it would allow you to load the file.
>
> Thanks
> Stephen
>
>
>
>
>
>
> -------- Original message --------
> From: "Nathan S. Watson-Haigh" <[hidden email]>
> Date: 9/7/16 4:58 PM (GMT-08:00)
> To: Stephen Ficklin <[hidden email]>, [hidden email]
> Subject: Re: [Gmod-tripal] Loading GFF3 with CDS with identical ID
>
> Hi Stephen,
>
>
> I read your response regarding unique GFF3 ID's within the scope of a particular GFF3 file. What you stated isn't strictly correct: discontinuous features occupy a different line in the GFF3 file, but DO have the same ID:
>
> http://gmod.org/wiki/GFF3#Discontinuous_Features
>
>
> However, it does also state that this GFF3 feature is not supported by the
>
> GMOD Chado bulk GFF3 loader. So, I think multiple identical ID's are valid GFF3 but Chado doesn't support them. Although Parent-child relationships would probably be a better way to group features which do have unique IDs.
>
>
> Sorry for chiming in with something not too helpful for the OP!
>
>
> Cheers,
>
> Nathan
>
> From: Stephen Ficklin <[hidden email]>
> Sent: Thursday, 8 September 2016 9:01 AM
> To: [hidden email]
> Subject: Re: [Gmod-tripal] Loading GFF3 with CDS with identical ID
>
> Hi Michael,
>
> Thanks for sending along the issue. Are you getting that exact same error?  Namely: "chado_insert_record; Cannot insert duplicate record into tripal_gffcds_temp table"?
> I think there are actually two issues in that posting. One was the issue I fixed with the temp table, and the second was with the GFF3 file itself.  We never resolved the second issue because there was not a follow up post and I failed to recognize it at the time....   So, I think the issue with the temp table is fixed, but the second problem is that it's not actually possible for two features in a GFF3 file to have the same ID.  As per the GFF3 standard on the GMOD wiki (http://gmod.org/wiki/GFF3) under "Column 9 tags"
> "ID: Indicates the unique identifier of the feature. IDs must be unique within the scope of the GFF file."
> Additionally, there is an integrity constraint on the Chado feature table that all IDs must be unique within the organism and feature type (e.g. CDS).  So it's not possible to load two CDS's for the same organism that have the exact same ID.  The integrity constraints on the table will cause the import to fail. Even if we adjusted the Tripal GFF loader to ignore CDSs with the same ID the Chado feature table constraints will block the insert.
> You might want to check with the gene prediction software creators (is it maker?) to make sure this isn't a bug in the GFF3 export.   But, a relatively simple fix would be to append the fmin position in the GFF file to the end of each CDS ID and that will give you a unique ID.  A little perl or python script should be able to do this for you.
> Stephen
>
>
> On 9/7/2016 7:28 AM, Michael Dhar wrote:
>> Hello,
>>
>> We're attempting to load a GFF3 file into Tripal, and have noticed that it will not accept multiple CDS features with the same ID. This post (http://gmod.827538.n3.nabble.com/Problem-loading-GFF-from-maker-into-Tripal-2-0-td4050189.html) suggests that this issue was addressed in a development patch in 2015. We are using the stable Tripal version 2.0, Chado version 1.23. Has the update for multiple CDS that have the same ID not been made to Tripal version 2.0? Is there any other way, or patch, to address this?
>>
>> Thanks very much.
>>
>> Sincerely,
>> Michael
>>
>>
>> ------------------------------------------------------------------------------
>>
>>
>>
>> _______________________________________________
>> Gmod-tripal mailing list
>>
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/gmod-tripal
>
> --
> ACPFG, Adelaide Australia. This email was Virus checked by Astaro Security Gateway. http://www.sophos.com
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Gmod-tripal mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-tripal

------------------------------------------------------------------------------

_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal

signature.asc (465 bytes) Download Attachment