Mailing list for GFF3 specification discussion?

classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Mailing list for GFF3 specification discussion?

Peter Cock
Dear Lincoln,

Is there a preferred mailing list for discussing the GFF3 specification?
I'm guessing gmod-gbrowse (CC'd) since there has been previous
discussion in the context of correctly formatting annotation for GBrowse.

My interest stems from wanting to support GFF3 properly in Biopython,
EMBOSS, etc - and discovering even major biological data providers
like the NCBI don't seem to be producing valid GFF3 files:

http://blastedbio.blogspot.com/2011/08/why-are-ncbi-gff3-files-still-broken.html

Thanks,

Peter

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Fields, Christopher J
The issue with NCBI's GFF3 output has been known for a while now.  They have been repeatedly notified about it and indicated that it was to be addressed at some point, but I haven't heard any updates on it.

chris

On Aug 17, 2011, at 6:13 AM, Peter Cock wrote:

> Dear Lincoln,
>
> Is there a preferred mailing list for discussing the GFF3 specification?
> I'm guessing gmod-gbrowse (CC'd) since there has been previous
> discussion in the context of correctly formatting annotation for GBrowse.
>
> My interest stems from wanting to support GFF3 properly in Biopython,
> EMBOSS, etc - and discovering even major biological data providers
> like the NCBI don't seem to be producing valid GFF3 files:
>
> http://blastedbio.blogspot.com/2011/08/why-are-ncbi-gff3-files-still-broken.html
>
> Thanks,
>
> Peter
>
> ------------------------------------------------------------------------------
> Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
> user administration capabilities and model configuration. Take
> the hassle out of deploying and managing Subversion and the
> tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
> _______________________________________________
> Gmod-gbrowse mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse


------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Peter Cock
On Aug 17, 2011, at 6:13 AM, Peter Cock wrote:

>> Dear Lincoln,
>>
>> Is there a preferred mailing list for discussing the GFF3 specification?
>> I'm guessing gmod-gbrowse (CC'd) since there has been previous
>> discussion in the context of correctly formatting annotation for GBrowse.
>>
>> My interest stems from wanting to support GFF3 properly in Biopython,
>> EMBOSS, etc - and discovering even major biological data providers
>> like the NCBI don't seem to be producing valid GFF3 files:
>>
>> http://blastedbio.blogspot.com/2011/08/why-are-ncbi-gff3-files-still-broken.html
>>
>> Thanks,
>>
>> Peter

On Wed, Aug 17, 2011 at 1:34 PM, Chris Fields <[hidden email]> wrote:
> The issue with NCBI's GFF3 output has been known for a while now.
> They have been repeatedly notified about it and indicated that it was
> to be addressed at some point, but I haven't heard any updates on it.
>
> chris

Hi Chris,

Does anyone on this list have a contact at the NCBI we can contact
directly about this? I wasn't aware of any NCBI statement that they
intend to fix this.

Anyway - I wanted to know where to bring up general issues with
clarification or changes to the GFF3 specification. Maybe this info
should go in the specification itself ;)
http://www.sequenceontology.org/gff3.shtml

For instance, when you have multiple reference sequences (e.g.
bacterial genome and two plasmids) and embed their sequences
within the GFF file using the ##FASTA is the expectation the file
goes:

* Features for reference one
* FASTA for reference one
* Features for reference two
* FASTA for reference two
* Features for reference three
* FASTA for reference three

To me at least, the current wording and example is unclear here.
Another reasonable interpretation would be:

* Features for reference one
* Features for reference two
* Features for reference three
* FASTA for reference one
* FASTA for reference two
* FASTA for reference three

I prefer the first version because then you can iterate over
the file in one pass, returning a complete reference sequence
with all its features, then move onto the next reference without
having to keep any of the previous features in memory. This
seems to be the interpretation taken by EMBOSS, e.g. this
sample file:
http://emboss.sourceforge.net/docs/themes/seqformats/gff

If that is the intent, then the example in v1.20 of the GFF3
spec could be expanded at the "..." line where the detail is
lacking. For instance, should the ##gff-version 3 header
be repeated at this point or not?

If the opposite was intended (all FASTA sequences at the
end) then this does seem to limit the utility of embedding
FASTA sequences (in my opinion), and means that EMBOSS
ought to be updated.

I've CC'd Peter Rice from EMBOSS in case he is not on this
list.

Thanks,

Peter C.

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Fields, Christopher J
On Aug 17, 2011, at 8:19 AM, Peter Cock wrote:

> On Aug 17, 2011, at 6:13 AM, Peter Cock wrote:
>>> Dear Lincoln,
>>>
>>> Is there a preferred mailing list for discussing the GFF3 specification?
>>> I'm guessing gmod-gbrowse (CC'd) since there has been previous
>>> discussion in the context of correctly formatting annotation for GBrowse.
>>>
>>> My interest stems from wanting to support GFF3 properly in Biopython,
>>> EMBOSS, etc - and discovering even major biological data providers
>>> like the NCBI don't seem to be producing valid GFF3 files:
>>>
>>> http://blastedbio.blogspot.com/2011/08/why-are-ncbi-gff3-files-still-broken.html
>>>
>>> Thanks,
>>>
>>> Peter
>
> On Wed, Aug 17, 2011 at 1:34 PM, Chris Fields <[hidden email]> wrote:
>> The issue with NCBI's GFF3 output has been known for a while now.
>> They have been repeatedly notified about it and indicated that it was
>> to be addressed at some point, but I haven't heard any updates on it.
>>
>> chris
>
> Hi Chris,
>
> Does anyone on this list have a contact at the NCBI we can contact
> directly about this? I wasn't aware of any NCBI statement that they
> intend to fix this.
>
> Anyway - I wanted to know where to bring up general issues with
> clarification or changes to the GFF3 specification. Maybe this info
> should go in the specification itself ;)
> http://www.sequenceontology.org/gff3.shtml
>
> For instance, when you have multiple reference sequences (e.g.
> bacterial genome and two plasmids) and embed their sequences
> within the GFF file using the ##FASTA is the expectation the file
> goes:
>
> * Features for reference one
> * FASTA for reference one
> * Features for reference two
> * FASTA for reference two
> * Features for reference three
> * FASTA for reference three
>
> To me at least, the current wording and example is unclear here.
> Another reasonable interpretation would be:

No, I believe the FASTA is all at the end of the file, as you have below.  Under 'Other Syntax':

##FASTA
        This notation indicates that the annotation portion of the
        file is at an end and that the remainder of the file
        contains one or more sequences (nucleotide or protein)
        in FASTA format.  This allows features and sequences to
        be bundled together.  Example:

> * Features for reference one
> * Features for reference two
> * Features for reference three
> * FASTA for reference one
> * FASTA for reference two
> * FASTA for reference three

...

> I prefer the first version because then you can iterate over
> the file in one pass, returning a complete reference sequence
> with all its features, then move onto the next reference without
> having to keep any of the previous features in memory. This
> seems to be the interpretation taken by EMBOSS, e.g. this
> sample file:
> http://emboss.sourceforge.net/docs/themes/seqformats/gff
>
> If that is the intent, then the example in v1.20 of the GFF3
> spec could be expanded at the "..." line where the detail is
> lacking. For instance, should the ##gff-version 3 header
> be repeated at this point or not?
>
> If the opposite was intended (all FASTA sequences at the
> end) then this does seem to limit the utility of embedding
> FASTA sequences (in my opinion), and means that EMBOSS
> ought to be updated.

Yes, EMBOSS should be updated.  I don't think sequence is meant to be embedded within the features; IIRC having the seqs at the end allows one to cleanly index the FASTA sequence separately from the features (I don't think very large GFF3 files were meant to be housed completely in memory), but maybe Lincoln can elaborate more on that.

> I've CC'd Peter Rice from EMBOSS in case he is not on this
> list.
>
> Thanks,
>
> Peter C.

Cool.  Hi Peter (R)!

chris


------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Lincoln Stein
Hi Peter, Chris,

GMOD is a good place to discuss these issues, but you should also Cc the Sequence Ontology mailing list, which is the official venue:


Lincoln

On Wed, Aug 17, 2011 at 9:44 AM, Chris Fields <[hidden email]> wrote:
On Aug 17, 2011, at 8:19 AM, Peter Cock wrote:

> On Aug 17, 2011, at 6:13 AM, Peter Cock wrote:
>>> Dear Lincoln,
>>>
>>> Is there a preferred mailing list for discussing the GFF3 specification?
>>> I'm guessing gmod-gbrowse (CC'd) since there has been previous
>>> discussion in the context of correctly formatting annotation for GBrowse.
>>>
>>> My interest stems from wanting to support GFF3 properly in Biopython,
>>> EMBOSS, etc - and discovering even major biological data providers
>>> like the NCBI don't seem to be producing valid GFF3 files:
>>>
>>> http://blastedbio.blogspot.com/2011/08/why-are-ncbi-gff3-files-still-broken.html
>>>
>>> Thanks,
>>>
>>> Peter
>
> On Wed, Aug 17, 2011 at 1:34 PM, Chris Fields <[hidden email]> wrote:
>> The issue with NCBI's GFF3 output has been known for a while now.
>> They have been repeatedly notified about it and indicated that it was
>> to be addressed at some point, but I haven't heard any updates on it.
>>
>> chris
>
> Hi Chris,
>
> Does anyone on this list have a contact at the NCBI we can contact
> directly about this? I wasn't aware of any NCBI statement that they
> intend to fix this.
>
> Anyway - I wanted to know where to bring up general issues with
> clarification or changes to the GFF3 specification. Maybe this info
> should go in the specification itself ;)
> http://www.sequenceontology.org/gff3.shtml
>
> For instance, when you have multiple reference sequences (e.g.
> bacterial genome and two plasmids) and embed their sequences
> within the GFF file using the ##FASTA is the expectation the file
> goes:
>
> * Features for reference one
> * FASTA for reference one
> * Features for reference two
> * FASTA for reference two
> * Features for reference three
> * FASTA for reference three
>
> To me at least, the current wording and example is unclear here.
> Another reasonable interpretation would be:

No, I believe the FASTA is all at the end of the file, as you have below.  Under 'Other Syntax':

##FASTA
       This notation indicates that the annotation portion of the
       file is at an end and that the remainder of the file
       contains one or more sequences (nucleotide or protein)
       in FASTA format.  This allows features and sequences to
       be bundled together.  Example:

> * Features for reference one
> * Features for reference two
> * Features for reference three
> * FASTA for reference one
> * FASTA for reference two
> * FASTA for reference three

...

> I prefer the first version because then you can iterate over
> the file in one pass, returning a complete reference sequence
> with all its features, then move onto the next reference without
> having to keep any of the previous features in memory. This
> seems to be the interpretation taken by EMBOSS, e.g. this
> sample file:
> http://emboss.sourceforge.net/docs/themes/seqformats/gff
>
> If that is the intent, then the example in v1.20 of the GFF3
> spec could be expanded at the "..." line where the detail is
> lacking. For instance, should the ##gff-version 3 header
> be repeated at this point or not?
>
> If the opposite was intended (all FASTA sequences at the
> end) then this does seem to limit the utility of embedding
> FASTA sequences (in my opinion), and means that EMBOSS
> ought to be updated.

Yes, EMBOSS should be updated.  I don't think sequence is meant to be embedded within the features; IIRC having the seqs at the end allows one to cleanly index the FASTA sequence separately from the features (I don't think very large GFF3 files were meant to be housed completely in memory), but maybe Lincoln can elaborate more on that.

> I've CC'd Peter Rice from EMBOSS in case he is not on this
> list.
>
> Thanks,
>
> Peter C.

Cool.  Hi Peter (R)!

chris




--
Lincoln D. Stein
Director, Informatics and Biocomputing Platform
Ontario Institute for Cancer Research
101 College St., Suite 800
Toronto, ON, Canada M5G0A3
416 673-8514
Assistant: Renata Musa <[hidden email]>

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Peter Cock
Thanks Lincoln,

I've therefore CC'd song-devel with the original content below for reference.

Thanks,

Peter C.

On Wed, Aug 17, 2011 at 3:20 PM, Lincoln Stein <[hidden email]> wrote:

> Hi Peter, Chris,
> GMOD is a good place to discuss these issues, but you should also Cc the
> Sequence Ontology mailing list, which is the official venue:
>
> http://sourceforge.net/mail/?group_id=72703
>
> Lincoln
>
> On Wed, Aug 17, 2011 at 9:44 AM, Chris Fields <[hidden email]> wrote:
>>
>> On Aug 17, 2011, at 8:19 AM, Peter Cock wrote:
>>
>> > On Aug 17, 2011, at 6:13 AM, Peter Cock wrote:
>> >>> Dear Lincoln,
>> >>>
>> >>> Is there a preferred mailing list for discussing the GFF3
>> >>> specification?
>> >>> I'm guessing gmod-gbrowse (CC'd) since there has been previous
>> >>> discussion in the context of correctly formatting annotation for
>> >>> GBrowse.
>> >>>
>> >>> My interest stems from wanting to support GFF3 properly in Biopython,
>> >>> EMBOSS, etc - and discovering even major biological data providers
>> >>> like the NCBI don't seem to be producing valid GFF3 files:
>> >>>
>> >>> http://blastedbio.blogspot.com/2011/08/why-are-ncbi-gff3-files-still-broken.html
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Peter
>> >
>> > On Wed, Aug 17, 2011 at 1:34 PM, Chris Fields <[hidden email]>
>> > wrote:
>> >> The issue with NCBI's GFF3 output has been known for a while now.
>> >> They have been repeatedly notified about it and indicated that it was
>> >> to be addressed at some point, but I haven't heard any updates on it.
>> >>
>> >> chris
>> >
>> > Hi Chris,
>> >
>> > Does anyone on this list have a contact at the NCBI we can contact
>> > directly about this? I wasn't aware of any NCBI statement that they
>> > intend to fix this.
>> >
>> > Anyway - I wanted to know where to bring up general issues with
>> > clarification or changes to the GFF3 specification. Maybe this info
>> > should go in the specification itself ;)
>> > http://www.sequenceontology.org/gff3.shtml
>> >
>> > For instance, when you have multiple reference sequences (e.g.
>> > bacterial genome and two plasmids) and embed their sequences
>> > within the GFF file using the ##FASTA is the expectation the file
>> > goes:
>> >
>> > * Features for reference one
>> > * FASTA for reference one
>> > * Features for reference two
>> > * FASTA for reference two
>> > * Features for reference three
>> > * FASTA for reference three
>> >
>> > To me at least, the current wording and example is unclear here.
>> > Another reasonable interpretation would be:
>>
>> No, I believe the FASTA is all at the end of the file, as you have below.
>>  Under 'Other Syntax':
>>
>> ##FASTA
>>        This notation indicates that the annotation portion of the
>>        file is at an end and that the remainder of the file
>>        contains one or more sequences (nucleotide or protein)
>>        in FASTA format.  This allows features and sequences to
>>        be bundled together.  Example:
>>
>> > * Features for reference one
>> > * Features for reference two
>> > * Features for reference three
>> > * FASTA for reference one
>> > * FASTA for reference two
>> > * FASTA for reference three
>>
>> ...
>>
>> > I prefer the first version because then you can iterate over
>> > the file in one pass, returning a complete reference sequence
>> > with all its features, then move onto the next reference without
>> > having to keep any of the previous features in memory. This
>> > seems to be the interpretation taken by EMBOSS, e.g. this
>> > sample file:
>> > http://emboss.sourceforge.net/docs/themes/seqformats/gff
>> >
>> > If that is the intent, then the example in v1.20 of the GFF3
>> > spec could be expanded at the "..." line where the detail is
>> > lacking. For instance, should the ##gff-version 3 header
>> > be repeated at this point or not?
>> >
>> > If the opposite was intended (all FASTA sequences at the
>> > end) then this does seem to limit the utility of embedding
>> > FASTA sequences (in my opinion), and means that EMBOSS
>> > ought to be updated.
>>
>> Yes, EMBOSS should be updated.  I don't think sequence is meant to be
>> embedded within the features; IIRC having the seqs at the end allows one to
>> cleanly index the FASTA sequence separately from the features (I don't think
>> very large GFF3 files were meant to be housed completely in memory), but
>> maybe Lincoln can elaborate more on that.
>>
>> > I've CC'd Peter Rice from EMBOSS in case he is not on this
>> > list.
>> >
>> > Thanks,
>> >
>> > Peter C.
>>
>> Cool.  Hi Peter (R)!
>>
>> chris
>>

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Peter Cock
In reply to this post by Fields, Christopher J
Are the GBrowse people still interested? Should we continue
this on song-devel only?

On Wed, Aug 17, 2011 at 2:44 PM, Chris Fields <[hidden email]> wrote:

> On Aug 17, 2011, at 8:19 AM, Peter Cock wrote:
>> Anyway - I wanted to know where to bring up general issues with
>> clarification or changes to the GFF3 specification. Maybe this info
>> should go in the specification itself ;)
>> http://www.sequenceontology.org/gff3.shtml
>>
>> For instance, when you have multiple reference sequences (e.g.
>> bacterial genome and two plasmids) and embed their sequences
>> within the GFF file using the ##FASTA is the expectation the file
>> goes:
>>
>> * Features for reference one
>> * FASTA for reference one
>> * Features for reference two
>> * FASTA for reference two
>> * Features for reference three
>> * FASTA for reference three
>>
>> To me at least, the current wording and example is unclear here.
>> Another reasonable interpretation would be:
>
> No, I believe the FASTA is all at the end of the file, as you have
> below.  Under 'Other Syntax':
>
> ##FASTA
>        This notation indicates that the annotation portion of the
>        file is at an end and that the remainder of the file
>        contains one or more sequences (nucleotide or protein)
>        in FASTA format.  This allows features and sequences to
>        be bundled together.  Example:
>
>> * Features for reference one
>> * Features for reference two
>> * Features for reference three
>> * FASTA for reference one
>> * FASTA for reference two
>> * FASTA for reference three
>
> ...
>
>> I prefer the first version because then you can iterate over
>> the file in one pass, returning a complete reference sequence
>> with all its features, then move onto the next reference without
>> having to keep any of the previous features in memory. This
>> seems to be the interpretation taken by EMBOSS, e.g. this
>> sample file:
>> http://emboss.sourceforge.net/docs/themes/seqformats/gff
>>
>> If that is the intent, then the example in v1.20 of the GFF3
>> spec could be expanded at the "..." line where the detail is
>> lacking. For instance, should the ##gff-version 3 header
>> be repeated at this point or not?
>>
>> If the opposite was intended (all FASTA sequences at the
>> end) then this does seem to limit the utility of embedding
>> FASTA sequences (in my opinion), and means that EMBOSS
>> ought to be updated.
>
> Yes, EMBOSS should be updated.  I don't think sequence is
> meant to be embedded within the features; IIRC having the
> seqs at the end allows one to cleanly index the FASTA sequence
> separately from the features (I don't think very large GFF3 files
> were meant to be housed completely in memory), but maybe
> Lincoln can elaborate more on that.

OK, so any FASTA sequences should be in a single block right
at the end of the GFF3 file.

I agree that does then allow easy FASTA indexing separately
from the features, and concatenating/splitting the FASTA
data from the features.

I'm not convinced how useful this is - if you wanted to combine
the features with their sequence, a single pass iterator based
parser is impossible without holding all the features in memory
until the FASTA entries are reached.

Likewise a single pass conversion from GenBank/EMBL/UniProt
to GFF3 with embedded sequences also becomes impossible
(without massive memory overhead). In practice for EMBOSS
seqret doing this conversion, I think omitting the FASTA block
is the simplest solution.

Given the current EMBOSS output and my own confusion, I
think this area of the GFF3 specification could be clarified.

I believe the example would be clearer if it showed the final line
of the ctg123 sequence after the "...". Since the sequence length
is 1497228 and the wrapping is 50 characters, that would be
29944 full lines of 50bp and a final partial line of 28bp. e.g.

   ##gff-version   3
   ##sequence-region   ctg123 1 1497228
   ctg123 . gene            1000  9000  .  +  .  ID=gene00001;Name=EDEN
   ctg123 . TF_binding_site 1000  1012  .  +  .  ID=tfbs00001;Parent=gene00001
   ctg123 . mRNA            1050  9000  .  +  .
ID=mRNA00001;Parent=gene00001;Name=EDEN.1
   ctg123 . five_prime_UTR  1050  1200  .  +  .  Parent=mRNA00001
   ctg123 . CDS             1201  1500  .  +  0  ID=cds00001;Parent=mRNA00001
   ctg123 . CDS             3000  3902  .  +  0  ID=cds00001;Parent=mRNA00001
   ctg123 . CDS             5000  5500  .  +  0  ID=cds00001;Parent=mRNA00001
   ctg123 . CDS             7000  7600  .  +  0  ID=cds00001;Parent=mRNA00001
   ctg123 . three_prime_UTR 7601  9000  .  +  .  Parent=mRNA00001
   ctg123 . cDNA_match    1050  1500  5.8e-42  +  .
ID=match00001;Target=cdna0123+12+462
   ctg123 . cDNA_match    5000  5500  8.1e-43  +  .
ID=match00001;Target=cdna0123+463+963
   ctg123 . cDNA_match    7000  9000  1.4e-40  +  .
ID=match00001;Target=cdna0123+964+2964
   ##FASTA
   &gt;ctg123
   cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
   tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
   tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
   aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
   aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
   cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
   gtatttgatttgggtttactatcgaataatgagaattttcaggcttaggc
   ttaggcttaggcttaggcttaggcttaggcttaggcttaggcttaggctt
   aggcttaggcttaggcttaggcttaggcttaggcttaggcttaggcttag
   aatctagctagctatccgaaattcgaggcctgaaaagtgtgacgccattc
   ...
   acgtacgtacgtacgtacgtacgtacgt
   &gt;cnda0123
   ttcaagtgctcagtcaatgtgattcacagtatgtcaccaaatattttggc
   agctttctcaagggatcaaaattatggatcattatggaatacctcggtgg
   aggctcagcgctcgatttaactaaaagtggaaagctggacgaaagtcata
   tcgctgtgattcttcgcgaaattttgaaaggtctcgagtatctgcatagt
   gaaagaaaaatccacagagatattaaaggagccaacgttttgttggaccg
   tcaaacagcggctgtaaaaatttgtgattatggttaaagg

Clearly I've just made up those 28bp by repeating acgt. My aim is to
make it more visually more apparent that the ... just represents omitted
sequence lines. You can even state that in the text below the example.

Also, it wouldn't hurt to clarify that the ##gff-version 3 directive must be
present, *once and once only*, and must be the topmost line of the file.

Regards,

Peter C.

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Peter Rice
In reply to this post by Fields, Christopher J
On 17/08/2011 14:44, Chris Fields wrote:

> On Aug 17, 2011, at 8:19 AM, Peter Cock wrote:
>
>> For instance, when you have multiple reference sequences (e.g.
>> bacterial genome and two plasmids) and embed their sequences
>> within the GFF file using the ##FASTA is the expectation the file
>> goes:
>>
>> * Features for reference one
>> * FASTA for reference one
>> * Features for reference two
>> * FASTA for reference two
>> * Features for reference three
>> * FASTA for reference three
>>
>> To me at least, the current wording and example is unclear here.
>> Another reasonable interpretation would be:
>
> No, I believe the FASTA is all at the end of the file, as you have below.  Under 'Other Syntax':
>
> ##FASTA
> This notation indicates that the annotation portion of the
> file is at an end and that the remainder of the file
> contains one or more sequences (nucleotide or protein)
> in FASTA format.  This allows features and sequences to
> be bundled together.  Example:
>
>> * Features for reference one
>> * Features for reference two
>> * Features for reference three
>> * FASTA for reference one
>> * FASTA for reference two
>> * FASTA for reference three
>>
>> If the opposite was intended (all FASTA sequences at the
>> end) then this does seem to limit the utility of embedding
>> FASTA sequences (in my opinion), and means that EMBOSS
>> ought to be updated.
>
> Yes, EMBOSS should be updated.  I don't think sequence is meant to be embedded within the features; IIRC having the seqs at the end allows one to cleanly index the FASTA sequence separately from the features (I don't think very large GFF3 files were meant to be housed completely in memory), but maybe Lincoln can elaborate more on that.

Hmmm ... while EMBOSS can be modified to put the sequences at the end,
presumably that is with a single header. EMBOSS is "starting a new file"
by writing a new header.

I'll have a look at some multi-sequence files and probbaly go with the
single file, features first, sequences bunched at the end layout.

regards,

Peter Rice
EMBOSS Team

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Fields, Christopher J
On Aug 17, 2011, at 11:06 AM, Peter Rice wrote:

> On 17/08/2011 14:44, Chris Fields wrote:
>> On Aug 17, 2011, at 8:19 AM, Peter Cock wrote:
>>
>>> For instance, when you have multiple reference sequences (e.g.
>>> bacterial genome and two plasmids) and embed their sequences
>>> within the GFF file using the ##FASTA is the expectation the file
>>> goes:
>>> ...
>>> If the opposite was intended (all FASTA sequences at the
>>> end) then this does seem to limit the utility of embedding
>>> FASTA sequences (in my opinion), and means that EMBOSS
>>> ought to be updated.
>>
>> Yes, EMBOSS should be updated.  I don't think sequence is meant to be embedded within the features; IIRC having the seqs at the end allows one to cleanly index the FASTA sequence separately from the features (I don't think very large GFF3 files were meant to be housed completely in memory), but maybe Lincoln can elaborate more on that.
>
> Hmmm ... while EMBOSS can be modified to put the sequences at the end, presumably that is with a single header. EMBOSS is "starting a new file" by writing a new header.
>
> I'll have a look at some multi-sequence files and probbaly go with the single file, features first, sequences bunched at the end layout.
>
> regards,
>
> Peter Rice
> EMBOSS Team

Not sure how feasible it is, but maybe have the option of separating the two?  Having features-only GFF3 and a separate FASTA file for the sequences?  Most (all?) GFF3 loaders should be capable of dealing with both file types, and a lot of users don't like lumping the two together (e.g. prefer having them separate).

chris


------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Don Gilbert-2-3
In reply to this post by Peter Cock

Peter, Lincoln, et al.,

WHile you are making lists of incompatibilities in these genome feature formats,
add one I have to deal with, and not heard a GFF solution for yet:
  - genes can be split across scaffolds (chromosomes, reference genome).
That can be represented properly in Genbank/EMBL/DDBJ flatfile or asn.1, but
not in GFF3. A gene or mRNA span row cannot have two references, and cannot
be entered as two rows with same ID.
   scaffold1  x mRNA 100 500 .... ID=gene1;Note=split over scaffold1,2
   scaffold1  x exon 100 500 .... Parent=gene1
   scaffold2  x exon 200 300 .... Parent=gene1

Is there a suggested GFF representation for this model, where mRNA line captures
full extent, such as "scaffold1:100-500,scaffold2:200-300" (which is more or less
how Genbank format handles this) ?

I'm using a 2-row work around now, with two IDs for same gene:
  scaffold2  x mRNA 200 300 ...  ID=gene1part2
  scaffold2  x exon 200 300 .... Parent=gene1part2

- Don Gilbert

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Peter Rice
In reply to this post by Fields, Christopher J
On 17/08/2011 17:22, Chris Fields wrote:
> Not sure how feasible it is, but maybe have the option of separating the two?  Having features-only GFF3 and a separate FASTA file for the sequences?  Most (all?) GFF3 loaders should be capable of dealing with both file types, and a lot of users don't like lumping the two together (e.g. prefer having them separate).

Already done.

If an EMBOSS application writes sequences and features (for example,
"seqret -feature") the default is to write the sequence as FASTA and the
features as GFF3.

But if you explicitly request GFF as the output format then that is a
sequence format and it contains both (and can be read back in as a
sequence format)

regards,

Peter Rice
EMBOSS Team

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Jim Hu
In reply to this post by Fields, Christopher J
I think the SO list may be appropriate for discussion of the GFF3 specification.

I wonder if bioperl genbank2gff3.pl does a better/good enough job for conversion.  Nathan Liles in my group did some work on the converter a while ago, and we use it.  But as Peter points out in the blog post, it's nontrivial for a number of reasons.  Still, if it's an improvement, maybe we could not just kvetch to NCBI, but also suggest that they use our code instead. 

Peter, if nothing else, this thread has led me to follow you on twitter and tweet your blog post!

Jim Hu

On Aug 17, 2011, at 8:44 AM, Chris Fields wrote:

On Aug 17, 2011, at 8:19 AM, Peter Cock wrote:

On Aug 17, 2011, at 6:13 AM, Peter Cock wrote:
Dear Lincoln,

Is there a preferred mailing list for discussing the GFF3 specification?
I'm guessing gmod-gbrowse (CC'd) since there has been previous
discussion in the context of correctly formatting annotation for GBrowse.

My interest stems from wanting to support GFF3 properly in Biopython,
EMBOSS, etc - and discovering even major biological data providers
like the NCBI don't seem to be producing valid GFF3 files:

http://blastedbio.blogspot.com/2011/08/why-are-ncbi-gff3-files-still-broken.html

Thanks,

Peter

On Wed, Aug 17, 2011 at 1:34 PM, Chris Fields <[hidden email]> wrote:
The issue with NCBI's GFF3 output has been known for a while now.
They have been repeatedly notified about it and indicated that it was
to be addressed at some point, but I haven't heard any updates on it.

chris

Hi Chris,

Does anyone on this list have a contact at the NCBI we can contact
directly about this? I wasn't aware of any NCBI statement that they
intend to fix this.

Anyway - I wanted to know where to bring up general issues with
clarification or changes to the GFF3 specification. Maybe this info
should go in the specification itself ;)
http://www.sequenceontology.org/gff3.shtml

For instance, when you have multiple reference sequences (e.g.
bacterial genome and two plasmids) and embed their sequences
within the GFF file using the ##FASTA is the expectation the file
goes:

* Features for reference one
* FASTA for reference one
* Features for reference two
* FASTA for reference two
* Features for reference three
* FASTA for reference three

To me at least, the current wording and example is unclear here.
Another reasonable interpretation would be:

No, I believe the FASTA is all at the end of the file, as you have below.  Under 'Other Syntax':

##FASTA
This notation indicates that the annotation portion of the
file is at an end and that the remainder of the file
contains one or more sequences (nucleotide or protein)
in FASTA format.  This allows features and sequences to
be bundled together.  Example:

* Features for reference one
* Features for reference two
* Features for reference three
* FASTA for reference one
* FASTA for reference two
* FASTA for reference three

...

I prefer the first version because then you can iterate over
the file in one pass, returning a complete reference sequence
with all its features, then move onto the next reference without
having to keep any of the previous features in memory. This
seems to be the interpretation taken by EMBOSS, e.g. this
sample file:
http://emboss.sourceforge.net/docs/themes/seqformats/gff

If that is the intent, then the example in v1.20 of the GFF3
spec could be expanded at the "..." line where the detail is
lacking. For instance, should the ##gff-version 3 header
be repeated at this point or not?

If the opposite was intended (all FASTA sequences at the
end) then this does seem to limit the utility of embedding
FASTA sequences (in my opinion), and means that EMBOSS
ought to be updated.

Yes, EMBOSS should be updated.  I don't think sequence is meant to be embedded within the features; IIRC having the seqs at the end allows one to cleanly index the FASTA sequence separately from the features (I don't think very large GFF3 files were meant to be housed completely in memory), but maybe Lincoln can elaborate more on that.

I've CC'd Peter Rice from EMBOSS in case he is not on this
list.

Thanks,

Peter C.

Cool.  Hi Peter (R)!

chris


------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse

=====================================

Jim Hu

Associate Professor

Dept. of Biochemistry and Biophysics

2128 TAMU

Texas A&M Univ.

College Station, TX 77843-2128

979-862-4054




------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Peter Cock
On Wed, Aug 17, 2011 at 7:20 PM, Jim Hu <[hidden email]> wrote:
> I think the SO list may be appropriate for discussion of the GFF3
> specification.

Yes, we're continuing discussion about the GFF3 format there.

For anyone else looking for the SO list, its song-devel here,
and there is a moderation step as part of signing up:
https://lists.sourceforge.net/mailman/listinfo/song-devel

> I wonder if bioperl genbank2gff3.pl does a better/good enough job for
> conversion.  Nathan Liles in my group did some work on the converter a while
> ago, and we use it.  But as Peter points out in the blog post, it's
> nontrivial for a number of reasons.  Still, if it's an improvement, maybe we
> could not just kvetch to NCBI, but also suggest that they use our code
> instead.

Interestingly the TogoWS (a webservice in Japan) currently uses the
BioPerl converter internally - so there is precedent. I hope to fully test
how this does later (probably next week while at the BioHackathon
2011 http://2011.biohackathon.org/ in Kyoto), as well as following up
on the EMBOSS seqret conversion of GenBank to GFF3 with Peter Rice.

Presumably the BioPerl  genbank2gff3.pl script is a recommend
way to load GenBank files into GBrowse?

I've previously used GenBank to BioSQL to GBrowse, avoiding the
GFF3 conversion.

> Peter, if nothing else, this thread has led me to follow you on twitter and
> tweet your blog post!
> Jim Hu

Thank you,

Peter
@pjacock on twitter

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Fields, Christopher J
In reply to this post by Jim Hu
On Aug 17, 2011, at 1:20 PM, Jim Hu wrote:

> I think the SO list may be appropriate for discussion of the GFF3 specification.
>
> I wonder if bioperl genbank2gff3.pl does a better/good enough job for conversion.  Nathan Liles in my group did some work on the converter a while ago, and we use it.  But as Peter points out in the blog post, it's nontrivial for a number of reasons.  Still, if it's an improvement, maybe we could not just kvetch to NCBI, but also suggest that they use our code instead.
>
> Peter, if nothing else, this thread has led me to follow you on twitter and tweet your blog post!

Now if I could only comment on it!  (comments seem to be broken for me)

chris

> Jim Hu
>
> On Aug 17, 2011, at 8:44 AM, Chris Fields wrote:
>
>> On Aug 17, 2011, at 8:19 AM, Peter Cock wrote:
>>
>>> On Aug 17, 2011, at 6:13 AM, Peter Cock wrote:
>>>>> Dear Lincoln,
>>>>>
>>>>> Is there a preferred mailing list for discussing the GFF3 specification?
>>>>> I'm guessing gmod-gbrowse (CC'd) since there has been previous
>>>>> discussion in the context of correctly formatting annotation for GBrowse.
>>>>>
>>>>> My interest stems from wanting to support GFF3 properly in Biopython,
>>>>> EMBOSS, etc - and discovering even major biological data providers
>>>>> like the NCBI don't seem to be producing valid GFF3 files:
>>>>>
>>>>> http://blastedbio.blogspot.com/2011/08/why-are-ncbi-gff3-files-still-broken.html
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Peter
>>>
>>> On Wed, Aug 17, 2011 at 1:34 PM, Chris Fields <[hidden email]> wrote:
>>>> The issue with NCBI's GFF3 output has been known for a while now.
>>>> They have been repeatedly notified about it and indicated that it was
>>>> to be addressed at some point, but I haven't heard any updates on it.
>>>>
>>>> chris
>>>
>>> Hi Chris,
>>>
>>> Does anyone on this list have a contact at the NCBI we can contact
>>> directly about this? I wasn't aware of any NCBI statement that they
>>> intend to fix this.
>>>
>>> Anyway - I wanted to know where to bring up general issues with
>>> clarification or changes to the GFF3 specification. Maybe this info
>>> should go in the specification itself ;)
>>> http://www.sequenceontology.org/gff3.shtml
>>>
>>> For instance, when you have multiple reference sequences (e.g.
>>> bacterial genome and two plasmids) and embed their sequences
>>> within the GFF file using the ##FASTA is the expectation the file
>>> goes:
>>>
>>> * Features for reference one
>>> * FASTA for reference one
>>> * Features for reference two
>>> * FASTA for reference two
>>> * Features for reference three
>>> * FASTA for reference three
>>>
>>> To me at least, the current wording and example is unclear here.
>>> Another reasonable interpretation would be:
>>
>> No, I believe the FASTA is all at the end of the file, as you have below.  Under 'Other Syntax':
>>
>> ##FASTA
>> This notation indicates that the annotation portion of the
>> file is at an end and that the remainder of the file
>> contains one or more sequences (nucleotide or protein)
>> in FASTA format.  This allows features and sequences to
>> be bundled together.  Example:
>>
>>> * Features for reference one
>>> * Features for reference two
>>> * Features for reference three
>>> * FASTA for reference one
>>> * FASTA for reference two
>>> * FASTA for reference three
>>
>> ...
>>
>>> I prefer the first version because then you can iterate over
>>> the file in one pass, returning a complete reference sequence
>>> with all its features, then move onto the next reference without
>>> having to keep any of the previous features in memory. This
>>> seems to be the interpretation taken by EMBOSS, e.g. this
>>> sample file:
>>> http://emboss.sourceforge.net/docs/themes/seqformats/gff
>>>
>>> If that is the intent, then the example in v1.20 of the GFF3
>>> spec could be expanded at the "..." line where the detail is
>>> lacking. For instance, should the ##gff-version 3 header
>>> be repeated at this point or not?
>>>
>>> If the opposite was intended (all FASTA sequences at the
>>> end) then this does seem to limit the utility of embedding
>>> FASTA sequences (in my opinion), and means that EMBOSS
>>> ought to be updated.
>>
>> Yes, EMBOSS should be updated.  I don't think sequence is meant to be embedded within the features; IIRC having the seqs at the end allows one to cleanly index the FASTA sequence separately from the features (I don't think very large GFF3 files were meant to be housed completely in memory), but maybe Lincoln can elaborate more on that.
>>
>>> I've CC'd Peter Rice from EMBOSS in case he is not on this
>>> list.
>>>
>>> Thanks,
>>>
>>> Peter C.
>>
>> Cool.  Hi Peter (R)!
>>
>> chris
>>
>>
>> ------------------------------------------------------------------------------
>> Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
>> user administration capabilities and model configuration. Take
>> the hassle out of deploying and managing Subversion and the
>> tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
>> _______________________________________________
>> Gmod-gbrowse mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
>
> =====================================
> Jim Hu
> Associate Professor
> Dept. of Biochemistry and Biophysics
> 2128 TAMU
> Texas A&M Univ.
> College Station, TX 77843-2128
> 979-862-4054
>
>
> ------------------------------------------------------------------------------
> Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
> user administration capabilities and model configuration. Take
> the hassle out of deploying and managing Subversion and the
> tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2_______________________________________________
> Gmod-gbrowse mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse


------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Fields, Christopher J
In reply to this post by Peter Cock
On Aug 17, 2011, at 1:39 PM, Peter Cock wrote:

> On Wed, Aug 17, 2011 at 7:20 PM, Jim Hu <[hidden email]> wrote:
>> I think the SO list may be appropriate for discussion of the GFF3
>> specification.
>
> Yes, we're continuing discussion about the GFF3 format there.
>
> For anyone else looking for the SO list, its song-devel here,
> and there is a moderation step as part of signing up:
> https://lists.sourceforge.net/mailman/listinfo/song-devel
>
>> I wonder if bioperl genbank2gff3.pl does a better/good enough job for
>> conversion.  Nathan Liles in my group did some work on the converter a while
>> ago, and we use it.  But as Peter points out in the blog post, it's
>> nontrivial for a number of reasons.  Still, if it's an improvement, maybe we
>> could not just kvetch to NCBI, but also suggest that they use our code
>> instead.
>
> Interestingly the TogoWS (a webservice in Japan) currently uses the
> BioPerl converter internally - so there is precedent. I hope to fully test
> how this does later (probably next week while at the BioHackathon
> 2011 http://2011.biohackathon.org/ in Kyoto), as well as following up
> on the EMBOSS seqret conversion of GenBank to GFF3 with Peter Rice.
>
> Presumably the BioPerl  genbank2gff3.pl script is a recommend
> way to load GenBank files into GBrowse?

Generally, but from what I understand even that may require some additional conversion.  I have used it with some success myself.  We're rewriting some of the code for that (specifically Bio::FeatureIO) but it's stalled somewhat based on the bioperl split.

> I've previously used GenBank to BioSQL to GBrowse, avoiding the
> GFF3 conversion.

I'm not sure how well the BioSQL adaptor is suported for GBrowse, it seems to fluctuate quite a bit.

>> Peter, if nothing else, this thread has led me to follow you on twitter and
>> tweet your blog post!
>> Jim Hu
>
> Thank you,
>
> Peter
> @pjacock on twitter

Yes, nice post!  

chris




------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Scott Cain
I talked informally with someone from NCBI a few months ago.  He told
me that they were aware that their GFF3 production was flawed (I think
I may have reiterated, adding "horribly flawed" :-)  and that they had
it in the queue to fix, but it was a very, very low priority, since
they don't need it for anything, and I imagine, they don't perceive
much outside demand (imagine that, making something that is bad, and
then being surprised when nobody wants it :-)  Basically, I think
they'll get around to fixing it when Congress gives them extra money
that they never asked for that's earmarked for GFF3 production.

As for them using BioPerl to produce GFF3: good luck with that.  NCBI
is the original "not invented here" shop, at least as far as
bioinformatics goes.  It strikes me as very unlikely that they'd use
the BioPerl converter.  Also, the BioPerl converter is not perfect
anyway.  While I'm sure it would get closer to correct, I suspect the
right way for NCBI to do it is the way they'd want to anyway: by
converting asn.1 directly.

Scott


On Wed, Aug 17, 2011 at 2:58 PM, Chris Fields <[hidden email]> wrote:

> On Aug 17, 2011, at 1:39 PM, Peter Cock wrote:
>
>> On Wed, Aug 17, 2011 at 7:20 PM, Jim Hu <[hidden email]> wrote:
>>> I think the SO list may be appropriate for discussion of the GFF3
>>> specification.
>>
>> Yes, we're continuing discussion about the GFF3 format there.
>>
>> For anyone else looking for the SO list, its song-devel here,
>> and there is a moderation step as part of signing up:
>> https://lists.sourceforge.net/mailman/listinfo/song-devel
>>
>>> I wonder if bioperl genbank2gff3.pl does a better/good enough job for
>>> conversion.  Nathan Liles in my group did some work on the converter a while
>>> ago, and we use it.  But as Peter points out in the blog post, it's
>>> nontrivial for a number of reasons.  Still, if it's an improvement, maybe we
>>> could not just kvetch to NCBI, but also suggest that they use our code
>>> instead.
>>
>> Interestingly the TogoWS (a webservice in Japan) currently uses the
>> BioPerl converter internally - so there is precedent. I hope to fully test
>> how this does later (probably next week while at the BioHackathon
>> 2011 http://2011.biohackathon.org/ in Kyoto), as well as following up
>> on the EMBOSS seqret conversion of GenBank to GFF3 with Peter Rice.
>>
>> Presumably the BioPerl  genbank2gff3.pl script is a recommend
>> way to load GenBank files into GBrowse?
>
> Generally, but from what I understand even that may require some additional conversion.  I have used it with some success myself.  We're rewriting some of the code for that (specifically Bio::FeatureIO) but it's stalled somewhat based on the bioperl split.
>
>> I've previously used GenBank to BioSQL to GBrowse, avoiding the
>> GFF3 conversion.
>
> I'm not sure how well the BioSQL adaptor is suported for GBrowse, it seems to fluctuate quite a bit.
>
>>> Peter, if nothing else, this thread has led me to follow you on twitter and
>>> tweet your blog post!
>>> Jim Hu
>>
>> Thank you,
>>
>> Peter
>> @pjacock on twitter
>
> Yes, nice post!
>
> chris
>
>
>
>
> ------------------------------------------------------------------------------
> Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
> user administration capabilities and model configuration. Take
> the hassle out of deploying and managing Subversion and the
> tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
> _______________________________________________
> Gmod-gbrowse mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
>



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Peter Cock
On Wed, Aug 17, 2011 at 8:20 PM, Scott Cain <[hidden email]> wrote:
> I talked informally with someone from NCBI a few months ago.  He told
> me that they were aware that their GFF3 production was flawed (I think
> I may have reiterated, adding "horribly flawed" :-)  and that they had
> it in the queue to fix, but it was a very, very low priority, since
> they don't need it for anything, and I imagine, they don't perceive
> much outside demand (imagine that, making something that is bad, and
> then being surprised when nobody wants it :-)  Basically, I think
> they'll get around to fixing it when Congress gives them extra money
> that they never asked for that's earmarked for GFF3 production.

The budget side of things is understandable - perhaps this
discussion will help demonstrate there is some interest.

> As for them using BioPerl to produce GFF3: good luck with that.
> NCBI is the original "not invented here" shop, at least as far as
> bioinformatics goes.  It strikes me as very unlikely that they'd use
> the BioPerl converter.

Well, we can ask ;)

>
> Also, the BioPerl converter is not perfect anyway.
>

I hope to review it next week.

> While I'm sure it would get closer to correct, I suspect the
> right way for NCBI to do it is the way they'd want to anyway: by
> converting asn.1 directly.
>
> Scott

Indeed - but practically no one in Bioinformatics outside
the NCBI uses asn.1 (no-one that I can think of at least)
which is awkward.

Peter

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Mailing list for GFF3 specification discussion?

Fields, Christopher J
In reply to this post by Scott Cain
Yes, have to agree re: NCBI's use of non-NCBI code; this practice extends to BLAST and other software (as Peter has also run into).  Unfortunately I find myself having to work around NCBI most of the time instead of with them, which for a gov-funded 'National Center for Biotechnology Information' is a bit frustrating and sad.

chris

On Aug 17, 2011, at 2:20 PM, Scott Cain wrote:

> I talked informally with someone from NCBI a few months ago.  He told
> me that they were aware that their GFF3 production was flawed (I think
> I may have reiterated, adding "horribly flawed" :-)  and that they had
> it in the queue to fix, but it was a very, very low priority, since
> they don't need it for anything, and I imagine, they don't perceive
> much outside demand (imagine that, making something that is bad, and
> then being surprised when nobody wants it :-)  Basically, I think
> they'll get around to fixing it when Congress gives them extra money
> that they never asked for that's earmarked for GFF3 production.
>
> As for them using BioPerl to produce GFF3: good luck with that.  NCBI
> is the original "not invented here" shop, at least as far as
> bioinformatics goes.  It strikes me as very unlikely that they'd use
> the BioPerl converter.  Also, the BioPerl converter is not perfect
> anyway.  While I'm sure it would get closer to correct, I suspect the
> right way for NCBI to do it is the way they'd want to anyway: by
> converting asn.1 directly.
>
> Scott
>
>
> On Wed, Aug 17, 2011 at 2:58 PM, Chris Fields <[hidden email]> wrote:
>> On Aug 17, 2011, at 1:39 PM, Peter Cock wrote:
>>
>>> On Wed, Aug 17, 2011 at 7:20 PM, Jim Hu <[hidden email]> wrote:
>>>> I think the SO list may be appropriate for discussion of the GFF3
>>>> specification.
>>>
>>> Yes, we're continuing discussion about the GFF3 format there.
>>>
>>> For anyone else looking for the SO list, its song-devel here,
>>> and there is a moderation step as part of signing up:
>>> https://lists.sourceforge.net/mailman/listinfo/song-devel
>>>
>>>> I wonder if bioperl genbank2gff3.pl does a better/good enough job for
>>>> conversion.  Nathan Liles in my group did some work on the converter a while
>>>> ago, and we use it.  But as Peter points out in the blog post, it's
>>>> nontrivial for a number of reasons.  Still, if it's an improvement, maybe we
>>>> could not just kvetch to NCBI, but also suggest that they use our code
>>>> instead.
>>>
>>> Interestingly the TogoWS (a webservice in Japan) currently uses the
>>> BioPerl converter internally - so there is precedent. I hope to fully test
>>> how this does later (probably next week while at the BioHackathon
>>> 2011 http://2011.biohackathon.org/ in Kyoto), as well as following up
>>> on the EMBOSS seqret conversion of GenBank to GFF3 with Peter Rice.
>>>
>>> Presumably the BioPerl  genbank2gff3.pl script is a recommend
>>> way to load GenBank files into GBrowse?
>>
>> Generally, but from what I understand even that may require some additional conversion.  I have used it with some success myself.  We're rewriting some of the code for that (specifically Bio::FeatureIO) but it's stalled somewhat based on the bioperl split.
>>
>>> I've previously used GenBank to BioSQL to GBrowse, avoiding the
>>> GFF3 conversion.
>>
>> I'm not sure how well the BioSQL adaptor is suported for GBrowse, it seems to fluctuate quite a bit.
>>
>>>> Peter, if nothing else, this thread has led me to follow you on twitter and
>>>> tweet your blog post!
>>>> Jim Hu
>>>
>>> Thank you,
>>>
>>> Peter
>>> @pjacock on twitter
>>
>> Yes, nice post!
>>
>> chris
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
>> user administration capabilities and model configuration. Take
>> the hassle out of deploying and managing Subversion and the
>> tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
>> _______________________________________________
>> Gmod-gbrowse mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
>>
>
>
>
> --
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                   scott at scottcain dot net
> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> Ontario Institute for Cancer Research
>
> ------------------------------------------------------------------------------
> Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
> user administration capabilities and model configuration. Take
> the hassle out of deploying and managing Subversion and the
> tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
> _______________________________________________
> Gmod-gbrowse mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse


------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: [SO-devel] Mailing list for GFF3 specification discussion?

Chris Mungall
In reply to this post by Peter Cock

It might help to disentangle the syntax of the format from the contents of the database.

Regardless of which syntax(es) NCBI provides, if they do not internally capture feature relationships or individual SO types such as introns, exons and stop codons in a consistent way, then the only way to generate these is algorithmically, using possibly-non-100%-reliable heuristics. This was definitely the situation a number of years ago when we implemented the bioperl unflattener:

        http://doc.bioperl.org/releases/bioperl-1.4/Bio/SeqFeature/Tools/Unflattener.html#DESCRIPTION

It uses a number of ad-hoc rules to detect which flavor of representation is used in the particular genbank record, and does it's best to generate a full feature hierarchy, materializing exons and other features in a consistent way.

This is used internally by the genbank2gff script, and is a source of unreliability. This extra processing step may be less of a priority if the GFF3 is intended solely to drive a genome browser display, as the average human can do a better job visually.

I don't think it's even permitted let alone likely that NCBI would go back and retrospectively fix submissions to use a more consistent representation, so the best that can be done is to specify a simple syntactic translation and leave it to heuristic tools in the bioperl realm for generating a normalized consistent representation, with possibly some alternative home for GFF3-compliant data.

(by the way, I haven't really looked at how genbank stores features in a number of years, so some or all of the above may be out of date)

On Aug 17, 2011, at 1:06 PM, Peter Cock wrote:

> On Wed, Aug 17, 2011 at 8:20 PM, Scott Cain <[hidden email]> wrote:
>> I talked informally with someone from NCBI a few months ago.  He told
>> me that they were aware that their GFF3 production was flawed (I think
>> I may have reiterated, adding "horribly flawed" :-)  and that they had
>> it in the queue to fix, but it was a very, very low priority, since
>> they don't need it for anything, and I imagine, they don't perceive
>> much outside demand (imagine that, making something that is bad, and
>> then being surprised when nobody wants it :-)  Basically, I think
>> they'll get around to fixing it when Congress gives them extra money
>> that they never asked for that's earmarked for GFF3 production.
>
> The budget side of things is understandable - perhaps this
> discussion will help demonstrate there is some interest.
>
>> As for them using BioPerl to produce GFF3: good luck with that.
>> NCBI is the original "not invented here" shop, at least as far as
>> bioinformatics goes.  It strikes me as very unlikely that they'd use
>> the BioPerl converter.
>
> Well, we can ask ;)
>
>>
>> Also, the BioPerl converter is not perfect anyway.
>>
>
> I hope to review it next week.
>
>>  While I'm sure it would get closer to correct, I suspect the
>> right way for NCBI to do it is the way they'd want to anyway: by
>> converting asn.1 directly.
>>
>> Scott
>
> Indeed - but practically no one in Bioinformatics outside
> the NCBI uses asn.1 (no-one that I can think of at least)
> which is awkward.
>
> Peter
>
> ------------------------------------------------------------------------------
> Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
> user administration capabilities and model configuration. Take
> the hassle out of deploying and managing Subversion and the
> tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
> _______________________________________________
> SOng-devel mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/song-devel


------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: [SO-devel] Mailing list for GFF3 specification discussion?

Peter Cock
On Wed, Aug 17, 2011 at 9:28 PM, Chris Mungall <[hidden email]> wrote:

>
> It might help to disentangle the syntax of the format from
> the contents of the database.
>
> Regardless of which syntax(es) NCBI provides, if they
> do not internally capture feature relationships or individual
> SO types such as introns, exons and stop codons in a
> consistent way, then the only way to generate these is
> algorithmically, using possibly-non-100%-reliable heuristics.
> This was definitely the situation a number of years ago
> when we implemented the bioperl unflattener:
>
>        http://doc.bioperl.org/releases/bioperl-1.4/Bio/SeqFeature/Tools/Unflattener.html#DESCRIPTION
>
> It uses a number of ad-hoc rules to detect which flavour
> of representation is used in the particular genbank record,
> and does it's best to generate a full feature hierarchy,
> materializing exons and other features in a consistent way.
>
> This is used internally by the genbank2gff script, and is a
> source of unreliability. This extra processing step may be
> less of a priority if the GFF3 is intended solely to drive a
> genome browser display, as the average human can do
> a better job visually.
>
> I don't think it's even permitted let alone likely that NCBI
> would go back and retrospectively fix submissions to use
> a more consistent representation, so the best that can be
> done is to specify a simple syntactic translation and leave
> it to heuristic tools in the bioperl realm for generating a
> normalized consistent representation, with possibly some
> alternative home for GFF3-compliant data.
>
> (by the way, I haven't really looked at how genbank stores
> features in a number of years, so some or all of the above
> may be out of date)
>

Hi Chris,

I too suspect that the NCBI's internal representation does not
explicitly have parent/child relationships between genes, mRNA,
and CDS features.

However, while using them is good practice, they could be
omitted for a simplistic (but none the less valid) GFF3 file.
This was what I was suggesting in the short term to Peter
Rice (CC'd) for when EMBOSS seqret does GenBank to
GFF3 conversion [To my mind, there are more pressing
and easier to solve problems with the current secret output].
http://emboss.open-bio.org/pipermail/emboss-dev/2011-August/000704.html

Regards,

Peter

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
123