Transcript & protein fasta sequence id/name collisions

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Transcript & protein fasta sequence id/name collisions

Stein, Joshua
Dear Carson and maker-devel group,

In our recent MAKER run, some of the transcript and protein id’s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=“ field of the GFF.  The problem is that these names are not unique, so for example the transcript ID ‘mRNA_4’ occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the “pred_gff=“ parameter.

How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=“ field for transcript/protein fasta id’s)?
Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique.

Thanks,
Josh


Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
[hidden email]
http://ware.cshl.org/



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Transcript & protein fasta sequence id/name collisions

Carson Holt-2
The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ‘Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice” GFF3. You may need to slightly alter it before using it.

On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it’s own unique names for things, but for model_gff it will keep the name you give it.

—Carson


> On Jun 12, 2018, at 12:08 PM, Stein, Joshua <[hidden email]> wrote:
>
> Dear Carson and maker-devel group,
>
> In our recent MAKER run, some of the transcript and protein id’s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=“ field of the GFF.  The problem is that these names are not unique, so for example the transcript ID ‘mRNA_4’ occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the “pred_gff=“ parameter.
>
> How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=“ field for transcript/protein fasta id’s)?
> Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique.
>
> Thanks,
> Josh
>
>
> Joshua Stein, PhD
> Manager, Sci. Informatics III
> Cold Spring Harbor Laboratory
> [hidden email]
> http://ware.cshl.org/
>
>
>


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Transcript & protein fasta sequence id/name collisions

Stein, Joshua
Hi Carson,
Thanks for identifying the problem.  I see that the input fgenesh GFF3 did not use unique identifiers, so we will attack the problem there.

Best,
Josh

> On Jun 12, 2018, at 4:19 PM, Carson Holt <[hidden email]> wrote:
>
> The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ‘Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice” GFF3. You may need to slightly alter it before using it.
>
> On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it’s own unique names for things, but for model_gff it will keep the name you give it.
>
> —Carson
>
>
>> On Jun 12, 2018, at 12:08 PM, Stein, Joshua <[hidden email]> wrote:
>>
>> Dear Carson and maker-devel group,
>>
>> In our recent MAKER run, some of the transcript and protein id’s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=“ field of the GFF.  The problem is that these names are not unique, so for example the transcript ID ‘mRNA_4’ occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the “pred_gff=“ parameter.
>>
>> How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=“ field for transcript/protein fasta id’s)?
>> Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique.
>>
>> Thanks,
>> Josh
>>
>>
>> Joshua Stein, PhD
>> Manager, Sci. Informatics III
>> Cold Spring Harbor Laboratory
>> [hidden email]
>> http://ware.cshl.org/
>>
>>
>>
>

Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
[hidden email]
http://ware.cshl.org/



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org