loading scaffold features into chado

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

loading scaffold features into chado

claudia
To whom it may concern,

I have 2 concerns, the first is:  regarding representing scaffold
features in chado and gbrowse. I noticed that the Sequence ontology uses
the term supercontig and so if my assembly generated scaffolds entitled
"scaffold" should I change the names to supercontigs so that chado
recognizes the terms?

Corresponding to my first question, Maker does not know that the contigs
are actually scaffold/supercontigs when annotating and so Maker will
still call the "type" feature or column 3 in the GFF3, a 'contig', how
can Maker be implemented to change this naming convention before
annotation, or after?

Consequently, I am having problems pulling up gene features in Gbrowse
when doing a generic gene search, and I must provide the maker generated
unique-gene_id in the gbrowse search bar or the known sequence id i.e
'scaffold001', which is not useful for someone who does not have this
information.
---- I do not have this problem when my seq_id, and 'type' feature id
match in the true case of 'contigs'. I can do a generic gene search in
gbrowse with the term 'maker' and gbrowse will provide me all the
associated maker generated gene calls.

Thank you for any guidance resolving these concerns,
Claudia



--
Claudia DiNatale
Master's Candidate
The Crosby Lab
University of Windsor
519-253-3000 ext: 4755


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: loading scaffold features into chado

Carson Holt-2

>I have 2 concerns, the first is:  regarding representing scaffold
>features in chado and gbrowse. I noticed that the Sequence ontology uses
>the term supercontig and so if my assembly generated scaffolds entitled
>"scaffold" should I change the names to supercontigs so that chado
>recognizes the terms?

Yes.  You must use valid SO terms.  It is a requirement of GFF3, and Chado
will enforce this requirement on loading a GFF3 file (note Chado will even
go as far as to check the validity of the Ontology_term= attribute in GFF3
if you use it).  You can decide to use contig or supercontig as your
sequence feature.  It doesn¹t really matter unless you are placing both
into the database as separate features (i.e. You have a supercontig as the
parent feature and then you enter contigs individually as children of the
supercontig).


>
>Corresponding to my first question, Maker does not know that the contigs
>are actually scaffold/supercontigs when annotating and so Maker will
>still call the "type" feature or column 3 in the GFF3, a 'contig', how
>can Maker be implemented to change this naming convention before
>annotation, or after?

Not really important unless you plan on making contigs children of the
supercontig.  But you can always do a search and replace. -->
cat file.gff | perl -ane 's/\tcontig\t/\tsupercontig\t/s; print $_' >
new_file.gff


>
>Consequently, I am having problems pulling up gene features in Gbrowse
>when doing a generic gene search, and I must provide the maker generated
>unique-gene_id in the gbrowse search bar or the known sequence id i.e
>'scaffold001', which is not useful for someone who does not have this
>information.
>---- I do not have this problem when my seq_id, and 'type' feature id
>match in the true case of 'contigs'. I can do a generic gene search in
>gbrowse with the term 'maker' and gbrowse will provide me all the
>associated maker generated gene calls.

See "Adjusting GBrowse Name Searches" in the GBrowse tutorial -->
http://gmod.org/gbrowse2/tutorial/tutorial.html#naming


Thanks,
Carson











>
>Thank you for any guidance resolving these concerns,
>Claudia
>
>
>
>--
>Claudia DiNatale
>Master's Candidate
>The Crosby Lab
>University of Windsor
>519-253-3000 ext: 4755
>
>
>_______________________________________________
>maker-devel mailing list
>[hidden email]
>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: loading scaffold features into chado

Scott Cain
Hi Claudia,

I agree with everything that Carson wrote, except about name
searching--it's a little trickier in Chado.  What you probably want to
do is implement full text searching.  See:

  http://gmod.org/wiki/Chado_Full_Text_Search

for more information on setting it up and maintaining it.

Scott


On Tue, Mar 20, 2012 at 1:13 PM, Carson Holt <[hidden email]> wrote:

>
>>I have 2 concerns, the first is:  regarding representing scaffold
>>features in chado and gbrowse. I noticed that the Sequence ontology uses
>>the term supercontig and so if my assembly generated scaffolds entitled
>>"scaffold" should I change the names to supercontigs so that chado
>>recognizes the terms?
>
> Yes.  You must use valid SO terms.  It is a requirement of GFF3, and Chado
> will enforce this requirement on loading a GFF3 file (note Chado will even
> go as far as to check the validity of the Ontology_term= attribute in GFF3
> if you use it).  You can decide to use contig or supercontig as your
> sequence feature.  It doesn¹t really matter unless you are placing both
> into the database as separate features (i.e. You have a supercontig as the
> parent feature and then you enter contigs individually as children of the
> supercontig).
>
>
>>
>>Corresponding to my first question, Maker does not know that the contigs
>>are actually scaffold/supercontigs when annotating and so Maker will
>>still call the "type" feature or column 3 in the GFF3, a 'contig', how
>>can Maker be implemented to change this naming convention before
>>annotation, or after?
>
> Not really important unless you plan on making contigs children of the
> supercontig.  But you can always do a search and replace. -->
> cat file.gff | perl -ane 's/\tcontig\t/\tsupercontig\t/s; print $_' >
> new_file.gff
>
>
>>
>>Consequently, I am having problems pulling up gene features in Gbrowse
>>when doing a generic gene search, and I must provide the maker generated
>>unique-gene_id in the gbrowse search bar or the known sequence id i.e
>>'scaffold001', which is not useful for someone who does not have this
>>information.
>>---- I do not have this problem when my seq_id, and 'type' feature id
>>match in the true case of 'contigs'. I can do a generic gene search in
>>gbrowse with the term 'maker' and gbrowse will provide me all the
>>associated maker generated gene calls.
>
> See "Adjusting GBrowse Name Searches" in the GBrowse tutorial -->
> http://gmod.org/gbrowse2/tutorial/tutorial.html#naming
>
>
> Thanks,
> Carson
>
>
>
>
>
>
>
>
>
>
>
>>
>>Thank you for any guidance resolving these concerns,
>>Claudia
>>
>>
>>
>>--
>>Claudia DiNatale
>>Master's Candidate
>>The Crosby Lab
>>University of Windsor
>>519-253-3000 ext: 4755
>>
>>
>>_______________________________________________
>>maker-devel mailing list
>>[hidden email]
>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: loading scaffold features into chado

Carson Holt-2
Yes. thank you Scott.  My answer would work for GBrowse NOT Chado :-)

--Carson



On 12-03-20 1:25 PM, "Scott Cain" <[hidden email]> wrote:

>Hi Claudia,
>
>I agree with everything that Carson wrote, except about name
>searching--it's a little trickier in Chado.  What you probably want to
>do is implement full text searching.  See:
>
>  http://gmod.org/wiki/Chado_Full_Text_Search
>
>for more information on setting it up and maintaining it.
>
>Scott
>
>
>On Tue, Mar 20, 2012 at 1:13 PM, Carson Holt <[hidden email]> wrote:
>>
>>>I have 2 concerns, the first is:  regarding representing scaffold
>>>features in chado and gbrowse. I noticed that the Sequence ontology uses
>>>the term supercontig and so if my assembly generated scaffolds entitled
>>>"scaffold" should I change the names to supercontigs so that chado
>>>recognizes the terms?
>>
>> Yes.  You must use valid SO terms.  It is a requirement of GFF3, and
>>Chado
>> will enforce this requirement on loading a GFF3 file (note Chado will
>>even
>> go as far as to check the validity of the Ontology_term= attribute in
>>GFF3
>> if you use it).  You can decide to use contig or supercontig as your
>> sequence feature.  It doesn¹t really matter unless you are placing both
>> into the database as separate features (i.e. You have a supercontig as
>>the
>> parent feature and then you enter contigs individually as children of
>>the
>> supercontig).
>>
>>
>>>
>>>Corresponding to my first question, Maker does not know that the contigs
>>>are actually scaffold/supercontigs when annotating and so Maker will
>>>still call the "type" feature or column 3 in the GFF3, a 'contig', how
>>>can Maker be implemented to change this naming convention before
>>>annotation, or after?
>>
>> Not really important unless you plan on making contigs children of the
>> supercontig.  But you can always do a search and replace. -->
>> cat file.gff | perl -ane 's/\tcontig\t/\tsupercontig\t/s; print $_' >
>> new_file.gff
>>
>>
>>>
>>>Consequently, I am having problems pulling up gene features in Gbrowse
>>>when doing a generic gene search, and I must provide the maker generated
>>>unique-gene_id in the gbrowse search bar or the known sequence id i.e
>>>'scaffold001', which is not useful for someone who does not have this
>>>information.
>>>---- I do not have this problem when my seq_id, and 'type' feature id
>>>match in the true case of 'contigs'. I can do a generic gene search in
>>>gbrowse with the term 'maker' and gbrowse will provide me all the
>>>associated maker generated gene calls.
>>
>> See "Adjusting GBrowse Name Searches" in the GBrowse tutorial -->
>> http://gmod.org/gbrowse2/tutorial/tutorial.html#naming
>>
>>
>> Thanks,
>> Carson
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>>
>>>Thank you for any guidance resolving these concerns,
>>>Claudia
>>>
>>>
>>>
>>>--
>>>Claudia DiNatale
>>>Master's Candidate
>>>The Crosby Lab
>>>University of Windsor
>>>519-253-3000 ext: 4755
>>>
>>>
>>>_______________________________________________
>>>maker-devel mailing list
>>>[hidden email]
>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> [hidden email]
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
>--
>------------------------------------------------------------------------
>Scott Cain, Ph. D.                                   scott at scottcain
>dot net
>GMOD Coordinator (http://gmod.org/)                     216-392-3087
>Ontario Institute for Cancer Research



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: loading scaffold features into chado

claudia
In reply to this post by Scott Cain
Hi,

  I have enabled full text searching and I still have this problem,
another reason for concern... So I wondered if in fact I changed all the
ID's in the GFF3 file to supercontigs, then perhaps Chado would better
link all the terms, annotations, and fasta files.... Although, i realize
that the seq_id ( column 1) shouldn't need to be specific since the
'type' term would take care of designating the feature type, no?

Claudia



On 20/03/2012 1:25 PM, Scott Cain wrote:

> Hi Claudia,
>
> I agree with everything that Carson wrote, except about name
> searching--it's a little trickier in Chado.  What you probably want to
> do is implement full text searching.  See:
>
>    http://gmod.org/wiki/Chado_Full_Text_Search
>
> for more information on setting it up and maintaining it.
>
> Scott
>
>
> On Tue, Mar 20, 2012 at 1:13 PM, Carson Holt<[hidden email]>  wrote:
>>> I have 2 concerns, the first is:  regarding representing scaffold
>>> features in chado and gbrowse. I noticed that the Sequence ontology uses
>>> the term supercontig and so if my assembly generated scaffolds entitled
>>> "scaffold" should I change the names to supercontigs so that chado
>>> recognizes the terms?
>> Yes.  You must use valid SO terms.  It is a requirement of GFF3, and Chado
>> will enforce this requirement on loading a GFF3 file (note Chado will even
>> go as far as to check the validity of the Ontology_term= attribute in GFF3
>> if you use it).  You can decide to use contig or supercontig as your
>> sequence feature.  It doesn¹t really matter unless you are placing both
>> into the database as separate features (i.e. You have a supercontig as the
>> parent feature and then you enter contigs individually as children of the
>> supercontig).
>>
>>
>>> Corresponding to my first question, Maker does not know that the contigs
>>> are actually scaffold/supercontigs when annotating and so Maker will
>>> still call the "type" feature or column 3 in the GFF3, a 'contig', how
>>> can Maker be implemented to change this naming convention before
>>> annotation, or after?
>> Not really important unless you plan on making contigs children of the
>> supercontig.  But you can always do a search and replace. -->
>> cat file.gff | perl -ane 's/\tcontig\t/\tsupercontig\t/s; print $_'>
>> new_file.gff
>>
>>
>>> Consequently, I am having problems pulling up gene features in Gbrowse
>>> when doing a generic gene search, and I must provide the maker generated
>>> unique-gene_id in the gbrowse search bar or the known sequence id i.e
>>> 'scaffold001', which is not useful for someone who does not have this
>>> information.
>>> ---- I do not have this problem when my seq_id, and 'type' feature id
>>> match in the true case of 'contigs'. I can do a generic gene search in
>>> gbrowse with the term 'maker' and gbrowse will provide me all the
>>> associated maker generated gene calls.
>> See "Adjusting GBrowse Name Searches" in the GBrowse tutorial -->
>> http://gmod.org/gbrowse2/tutorial/tutorial.html#naming
>>
>>
>> Thanks,
>> Carson
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> Thank you for any guidance resolving these concerns,
>>> Claudia
>>>
>>>
>>>
>>> --
>>> Claudia DiNatale
>>> Master's Candidate
>>> The Crosby Lab
>>> University of Windsor
>>> 519-253-3000 ext: 4755
>>>
>>>
>>> _______________________________________________
>>> maker-devel mailing list
>>> [hidden email]
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> [hidden email]
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>


--
Claudia DiNatale
Master's Candidate
The Crosby Lab
University of Windsor
519-253-3000 ext: 4755


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: loading scaffold features into chado

Scott Cain
Hi Claudia,

Can you post a sample of the gff that shows what you are looking for and not finding?

Scott


Sent from my iPad

On Mar 20, 2012, at 2:03 PM, claudia <[hidden email]> wrote:

> Hi,
>
> I have enabled full text searching and I still have this problem, another reason for concern... So I wondered if in fact I changed all the ID's in the GFF3 file to supercontigs, then perhaps Chado would better link all the terms, annotations, and fasta files.... Although, i realize that the seq_id ( column 1) shouldn't need to be specific since the 'type' term would take care of designating the feature type, no?
>
> Claudia
>
>
>
> On 20/03/2012 1:25 PM, Scott Cain wrote:
>> Hi Claudia,
>>
>> I agree with everything that Carson wrote, except about name
>> searching--it's a little trickier in Chado.  What you probably want to
>> do is implement full text searching.  See:
>>
>>   http://gmod.org/wiki/Chado_Full_Text_Search
>>
>> for more information on setting it up and maintaining it.
>>
>> Scott
>>
>>
>> On Tue, Mar 20, 2012 at 1:13 PM, Carson Holt<[hidden email]>  wrote:
>>>> I have 2 concerns, the first is:  regarding representing scaffold
>>>> features in chado and gbrowse. I noticed that the Sequence ontology uses
>>>> the term supercontig and so if my assembly generated scaffolds entitled
>>>> "scaffold" should I change the names to supercontigs so that chado
>>>> recognizes the terms?
>>> Yes.  You must use valid SO terms.  It is a requirement of GFF3, and Chado
>>> will enforce this requirement on loading a GFF3 file (note Chado will even
>>> go as far as to check the validity of the Ontology_term= attribute in GFF3
>>> if you use it).  You can decide to use contig or supercontig as your
>>> sequence feature.  It doesn¹t really matter unless you are placing both
>>> into the database as separate features (i.e. You have a supercontig as the
>>> parent feature and then you enter contigs individually as children of the
>>> supercontig).
>>>
>>>
>>>> Corresponding to my first question, Maker does not know that the contigs
>>>> are actually scaffold/supercontigs when annotating and so Maker will
>>>> still call the "type" feature or column 3 in the GFF3, a 'contig', how
>>>> can Maker be implemented to change this naming convention before
>>>> annotation, or after?
>>> Not really important unless you plan on making contigs children of the
>>> supercontig.  But you can always do a search and replace. -->
>>> cat file.gff | perl -ane 's/\tcontig\t/\tsupercontig\t/s; print $_'>
>>> new_file.gff
>>>
>>>
>>>> Consequently, I am having problems pulling up gene features in Gbrowse
>>>> when doing a generic gene search, and I must provide the maker generated
>>>> unique-gene_id in the gbrowse search bar or the known sequence id i.e
>>>> 'scaffold001', which is not useful for someone who does not have this
>>>> information.
>>>> ---- I do not have this problem when my seq_id, and 'type' feature id
>>>> match in the true case of 'contigs'. I can do a generic gene search in
>>>> gbrowse with the term 'maker' and gbrowse will provide me all the
>>>> associated maker generated gene calls.
>>> See "Adjusting GBrowse Name Searches" in the GBrowse tutorial -->
>>> http://gmod.org/gbrowse2/tutorial/tutorial.html#naming
>>>
>>>
>>> Thanks,
>>> Carson
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> Thank you for any guidance resolving these concerns,
>>>> Claudia
>>>>
>>>>
>>>>
>>>> --
>>>> Claudia DiNatale
>>>> Master's Candidate
>>>> The Crosby Lab
>>>> University of Windsor
>>>> 519-253-3000 ext: 4755
>>>>
>>>>
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> [hidden email]
>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>
>>>
>>> _______________________________________________
>>> maker-devel mailing list
>>> [hidden email]
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>
>
> --
> Claudia DiNatale
> Master's Candidate
> The Crosby Lab
> University of Windsor
> 519-253-3000 ext: 4755
>

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: loading scaffold features into chado

Scott Cain
Hi Claudia,

I wanted to bring this back to the mailing lists, so I cc'ed them here.

First, with the fasta loading issue: what command are you using to
load the fasta sequences?  It works for me whether the fasta is at the
bottom of the GFF file or if it is a separate fasta file (as long as I
supply the --fastafile flag to the loader).

About the searching problem: when I turn on full text searching (which
means both running gmod_chado_fts_prep.pl and adding "-fulltext 1" to
the db_args in the gbrowse config file), I can search for "cnot1" and
find both a gene and an mRNA (of course, they are really the same
feature, but GBrowse doesn't know that).  Also, searching for "maker"
works, but in a real database, this will not be an effective query,
since the number of results returned are limited, and presumably there
will be lots of features resulting from a query like that.  Please
remind me, is that what you wanted to do?

Scott




On Wed, Mar 21, 2012 at 3:16 PM, claudia <[hidden email]> wrote:

> Hi Scott,
>
>  Wanted to give you a quick heads up that the bulk loader seems to be
> loading my fasta files now after deleting the ' ##FASTA' header ( the first
> line of the file now looks like this >scaffold0001)...
> Never had this problem before, it seems the bulk loader wanted to see a '>'
> symbol in front of the first line...
>
> -- when I say seems, I will let you know if it finishes, as it currently
>  states " Loading sequences ( if any)" ... and I never made it this far
> before :)
>
> Claudia
>
>
>
> On 21/03/2012 12:53 PM, Scott Cain wrote:
>>
>> Hi Claudia,
>>
>> I imagine one scaffold and gene models would be good--the problem is
>> finding genes, right?
>>
>> Also, with loading fasta: were the corresponding features from the GFF
>> file already loaded?  If so, that should have worked, and if it didn't
>> it implies a bug.  If not, that's why.
>>
>> Scott
>>
>>
>> On Wed, Mar 21, 2012 at 12:37 PM, claudia<[hidden email]>  wrote:
>>>
>>> Hi Scott,
>>>  So would one scaffold with Maker gene models suffice? Do you want the
>>> analysis as well?
>>>
>>> --along those same lines, I did try and load the original sequence
>>> (fasta)
>>> file first that I ran the Pipeline on and chado seems to refuse the files
>>> saying they don't contain the appropriate feature '>' in the header which
>>> in
>>> fact they do i.e>  scaffold00001 ... So not sure what is wrong with the
>>> fasta that chado doesn't want to load even if it is embedded in the GFF3,
>>> the bulk loader or maker2chado return errors stating 'feature not
>>> found'...
>>>
>>>
>>> Claudia
>>>
>>>
>>>
>>> On 21/03/2012 12:20 PM, Scott Cain wrote:
>>>>
>>>> Hi Claudia,
>>>>
>>>> I was hoping to get actual files that I could do testing on, not
>>>> pictures of files :-)
>>>>
>>>> Scott
>>>>
>>>>
>>>> On Tue, Mar 20, 2012 at 4:15 PM, Dinatale C<[hidden email]>
>>>>  wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Attached:  I have samples of the contig file ( I extracted the contig
>>>>> features first to load prior to the gene models) the fasta of the
>>>>> sequences
>>>>> is in the footer of the gff3 file.
>>>>>
>>>>> --so basically, based on experience with contig annotations, I should
>>>>> be
>>>>> able to type in 'maker' in to the gbrowse search bar, and recieve all
>>>>> the
>>>>> maker gene annotations, but I don't. I must specifiy the exact ID i.e "
>>>>> maker-scaffold11323-augustus-gene...." or 'scaffold11323'
>>>>>
>>>>> --so I wonder if it has to do with the fasta files being named as
>>>>> 'scaffolds' and perhaps causing a problem with chado recognizing that
>>>>> they
>>>>> are linked to the gene annotations due to scaffold not being a SOFA
>>>>> type
>>>>> term, if in fact the sequences must be submitted to the database first?
>>>>>
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Claudia
>>>>>
>>>>> On Tue, 20 Mar 2012 15:50:55 -0400 Scott Cain wrote:
>>>>>>
>>>>>> Hi Claudia,
>>>>>>
>>>>>> Can you post a sample of the gff that shows what you are looking for
>>>>>> and
>>>>>> not finding?
>>>>>>
>>>>>> Scott
>>>>>>
>>>>>>
>>>>>> Sent from my iPad
>>>>>>
>>>>>> On Mar 20, 2012, at 2:03 PM, claudia wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have enabled full text searching and I still have this problem,
>>>>>>> another reason for
>>>>>>
>>>>>> concern... So I wondered if in fact I changed all the ID's in the GFF3
>>>>>> file to supercontigs,
>>>>>> then perhaps Chado would better link all the terms, annotations, and
>>>>>> fasta
>>>>>> files....
>>>>>> Although, i realize that the seq_id ( column 1) shouldn't need to be
>>>>>> specific since the
>>>>>> 'type' term would take care of designating the feature type, no?
>>>>>>>
>>>>>>> Claudia
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 20/03/2012 1:25 PM, Scott Cain wrote:
>>>>>>>>
>>>>>>>> Hi Claudia,
>>>>>>>>
>>>>>>>> I agree with everything that Carson wrote, except about name
>>>>>>>> searching--it's a little trickier in Chado. What you probably want
>>>>>>>> to
>>>>>>>> do is implement full text searching. See:
>>>>>>>>
>>>>>>>> http://gmod.org/wiki/Chado_Full_Text_Search
>>>>>>>>
>>>>>>>> for more information on setting it up and maintaining it.
>>>>>>>>
>>>>>>>> Scott
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 20, 2012 at 1:13 PM, Carson Holt wrote:
>>>>>>>>>>
>>>>>>>>>> I have 2 concerns, the first is: regarding representing scaffold
>>>>>>>>>> features in chado and gbrowse. I noticed that the Sequence
>>>>>>>>>> ontology
>>>>>>>>>> uses
>>>>>>>>>> the term supercontig and so if my assembly generated scaffolds
>>>>>>>>>> entitled
>>>>>>>>>> "scaffold" should I change the names to supercontigs so that chado
>>>>>>>>>> recognizes the terms?
>>>>>>>>>
>>>>>>>>> Yes. You must use valid SO terms. It is a requirement of GFF3, and
>>>>>>>>> Chado
>>>>>>>>> will enforce this requirement on loading a GFF3 file (note Chado
>>>>>>>>> will
>>>>>>>>> even
>>>>>>>>> go as far as to check the validity of the Ontology_term= attribute
>>>>>>>>> in
>>>>>>>>> GFF3
>>>>>>>>> if you use it). You can decide to use contig or supercontig as your
>>>>>>>>> sequence feature. It doesnšt really matter unless you are placing
>>>>>>>>>
>>>>>>>>> both
>>>>>>>>> into the database as separate features (i.e. You have a supercontig
>>>>>>>>> as
>>>>>>>>> the
>>>>>>>>> parent feature and then you enter contigs individually as children
>>>>>>>>> of
>>>>>>>>> the
>>>>>>>>> supercontig).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Corresponding to my first question, Maker does not know that the
>>>>>>>>>> contigs
>>>>>>>>>> are actually scaffold/supercontigs when annotating and so Maker
>>>>>>>>>> will
>>>>>>>>>> still call the "type" feature or column 3 in the GFF3, a 'contig',
>>>>>>>>>> how
>>>>>>>>>> can Maker be implemented to change this naming convention before
>>>>>>>>>> annotation, or after?
>>>>>>>>>
>>>>>>>>> Not really important unless you plan on making contigs children of
>>>>>>>>> the
>>>>>>>>> supercontig. But you can always do a search and replace. -->
>>>>>>>>> cat file.gff | perl -ane 's/\tcontig\t/\tsupercontig\t/s; print
>>>>>>>>> $_'>
>>>>>>>>> new_file.gff
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Consequently, I am having problems pulling up gene features in
>>>>>>>>>> Gbrowse
>>>>>>>>>> when doing a generic gene search, and I must provide the maker
>>>>>>>>>> generated
>>>>>>>>>> unique-gene_id in the gbrowse search bar or the known sequence id
>>>>>>>>>> i.e
>>>>>>>>>> 'scaffold001', which is not useful for someone who does not have
>>>>>>>>>> this
>>>>>>>>>> information.
>>>>>>>>>> ---- I do not have this problem when my seq_id, and 'type' feature
>>>>>>>>>> id
>>>>>>>>>> match in the true case of 'contigs'. I can do a generic gene
>>>>>>>>>> search
>>>>>>>>>> in
>>>>>>>>>> gbrowse with the term 'maker' and gbrowse will provide me all the
>>>>>>>>>> associated maker generated gene calls.
>>>>>>>>>
>>>>>>>>> See "Adjusting GBrowse Name Searches" in the GBrowse tutorial -->
>>>>>>>>> http://gmod.org/gbrowse2/tutorial/tutorial.html#naming
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Carson
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thank you for any guidance resolving these concerns,
>>>>>>>>>> Claudia
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Claudia DiNatale
>>>>>>>>>> Master's Candidate
>>>>>>>>>> The Crosby Lab
>>>>>>>>>> University of Windsor
>>>>>>>>>> 519-253-3000 ext: 4755
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> maker-devel mailing list
>>>>>>>>>> [hidden email]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> maker-devel mailing list
>>>>>>>>> [hidden email]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> Claudia DiNatale
>>>>>>> Master's Candidate
>>>>>>> The Crosby Lab
>>>>>>> University of Windsor
>>>>>>> 519-253-3000 ext: 4755
>>>>>>>
>>>>
>>>
>>> --
>>> Claudia DiNatale
>>> Master's Candidate
>>> The Crosby Lab
>>> University of Windsor
>>> 519-253-3000 ext: 4755
>>>
>>
>>
>
>
> --
> Claudia DiNatale
> Master's Candidate
> The Crosby Lab
> University of Windsor
> 519-253-3000 ext: 4755
>



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: loading scaffold features into chado

claudia
Hi Scott,
  I tried both maker2chado and gmod_bulk_load_gff.pl with fasta embedded
or not... If the fastafile was seperate I used the command --fastafile
and if it was embedded, it was simply --gfffile with the other
appropriate commands for loading gene models not analysis ( i.e no
exon).  It finally resulted that, as I mentioned, when I changed the
header in the fasta file, the fasta loaded fine with the bulk-loader and
using the --fastafile command..

Next, my gene models loaded fine with the bulk loader, so the db was
populated. I turned on full-text searching by running the scripts, and
stating -full text 1 in the conf file...
So it seems, I can do as you mentioned, query specific genes i.e cnot1,
or with specific IDs or scaffold IDs, but what I really wanted is to be
able to do generic gene searches using the 'maker' search, as done with
my other databases, so that someone without knowledge of the contents of
the database will recieve gene information. This I still can't do, but
can be done with prior db's I set up that contained contig annotation
vs. scaffolds, and I found was very useful returning all maker generated
gene models.  With that, I thought the problem was with the nomenclature
being used across annotations, scaffolds etc..

Claudia

On 21/03/2012 4:59 PM, Scott Cain wrote:

> Hi Claudia,
>
> I wanted to bring this back to the mailing lists, so I cc'ed them here.
>
> First, with the fasta loading issue: what command are you using to
> load the fasta sequences?  It works for me whether the fasta is at the
> bottom of the GFF file or if it is a separate fasta file (as long as I
> supply the --fastafile flag to the loader).
>
> About the searching problem: when I turn on full text searching (which
> means both running gmod_chado_fts_prep.pl and adding "-fulltext 1" to
> the db_args in the gbrowse config file), I can search for "cnot1" and
> find both a gene and an mRNA (of course, they are really the same
> feature, but GBrowse doesn't know that).  Also, searching for "maker"
> works, but in a real database, this will not be an effective query,
> since the number of results returned are limited, and presumably there
> will be lots of features resulting from a query like that.  Please
> remind me, is that what you wanted to do?
>
> Scott
>
>
>
>
> On Wed, Mar 21, 2012 at 3:16 PM, claudia<[hidden email]>  wrote:
>> Hi Scott,
>>
>>   Wanted to give you a quick heads up that the bulk loader seems to be
>> loading my fasta files now after deleting the ' ##FASTA' header ( the first
>> line of the file now looks like this>scaffold0001)...
>> Never had this problem before, it seems the bulk loader wanted to see a '>'
>> symbol in front of the first line...
>>
>> -- when I say seems, I will let you know if it finishes, as it currently
>>   states " Loading sequences ( if any)" ... and I never made it this far
>> before :)
>>
>> Claudia
>>
>>
>>
>> On 21/03/2012 12:53 PM, Scott Cain wrote:
>>> Hi Claudia,
>>>
>>> I imagine one scaffold and gene models would be good--the problem is
>>> finding genes, right?
>>>
>>> Also, with loading fasta: were the corresponding features from the GFF
>>> file already loaded?  If so, that should have worked, and if it didn't
>>> it implies a bug.  If not, that's why.
>>>
>>> Scott
>>>
>>>
>>> On Wed, Mar 21, 2012 at 12:37 PM, claudia<[hidden email]>    wrote:
>>>> Hi Scott,
>>>>   So would one scaffold with Maker gene models suffice? Do you want the
>>>> analysis as well?
>>>>
>>>> --along those same lines, I did try and load the original sequence
>>>> (fasta)
>>>> file first that I ran the Pipeline on and chado seems to refuse the files
>>>> saying they don't contain the appropriate feature '>' in the header which
>>>> in
>>>> fact they do i.e>    scaffold00001 ... So not sure what is wrong with the
>>>> fasta that chado doesn't want to load even if it is embedded in the GFF3,
>>>> the bulk loader or maker2chado return errors stating 'feature not
>>>> found'...
>>>>
>>>>
>>>> Claudia
>>>>
>>>>
>>>>
>>>> On 21/03/2012 12:20 PM, Scott Cain wrote:
>>>>> Hi Claudia,
>>>>>
>>>>> I was hoping to get actual files that I could do testing on, not
>>>>> pictures of files :-)
>>>>>
>>>>> Scott
>>>>>
>>>>>
>>>>> On Tue, Mar 20, 2012 at 4:15 PM, Dinatale C<[hidden email]>
>>>>>   wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Attached:  I have samples of the contig file ( I extracted the contig
>>>>>> features first to load prior to the gene models) the fasta of the
>>>>>> sequences
>>>>>> is in the footer of the gff3 file.
>>>>>>
>>>>>> --so basically, based on experience with contig annotations, I should
>>>>>> be
>>>>>> able to type in 'maker' in to the gbrowse search bar, and recieve all
>>>>>> the
>>>>>> maker gene annotations, but I don't. I must specifiy the exact ID i.e "
>>>>>> maker-scaffold11323-augustus-gene...." or 'scaffold11323'
>>>>>>
>>>>>> --so I wonder if it has to do with the fasta files being named as
>>>>>> 'scaffolds' and perhaps causing a problem with chado recognizing that
>>>>>> they
>>>>>> are linked to the gene annotations due to scaffold not being a SOFA
>>>>>> type
>>>>>> term, if in fact the sequences must be submitted to the database first?
>>>>>>
>>>>>>
>>>>>> Thanks in advance,
>>>>>>
>>>>>> Claudia
>>>>>>
>>>>>> On Tue, 20 Mar 2012 15:50:55 -0400 Scott Cain wrote:
>>>>>>> Hi Claudia,
>>>>>>>
>>>>>>> Can you post a sample of the gff that shows what you are looking for
>>>>>>> and
>>>>>>> not finding?
>>>>>>>
>>>>>>> Scott
>>>>>>>
>>>>>>>
>>>>>>> Sent from my iPad
>>>>>>>
>>>>>>> On Mar 20, 2012, at 2:03 PM, claudia wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have enabled full text searching and I still have this problem,
>>>>>>>> another reason for
>>>>>>> concern... So I wondered if in fact I changed all the ID's in the GFF3
>>>>>>> file to supercontigs,
>>>>>>> then perhaps Chado would better link all the terms, annotations, and
>>>>>>> fasta
>>>>>>> files....
>>>>>>> Although, i realize that the seq_id ( column 1) shouldn't need to be
>>>>>>> specific since the
>>>>>>> 'type' term would take care of designating the feature type, no?
>>>>>>>> Claudia
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 20/03/2012 1:25 PM, Scott Cain wrote:
>>>>>>>>> Hi Claudia,
>>>>>>>>>
>>>>>>>>> I agree with everything that Carson wrote, except about name
>>>>>>>>> searching--it's a little trickier in Chado. What you probably want
>>>>>>>>> to
>>>>>>>>> do is implement full text searching. See:
>>>>>>>>>
>>>>>>>>> http://gmod.org/wiki/Chado_Full_Text_Search
>>>>>>>>>
>>>>>>>>> for more information on setting it up and maintaining it.
>>>>>>>>>
>>>>>>>>> Scott
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 20, 2012 at 1:13 PM, Carson Holt wrote:
>>>>>>>>>>> I have 2 concerns, the first is: regarding representing scaffold
>>>>>>>>>>> features in chado and gbrowse. I noticed that the Sequence
>>>>>>>>>>> ontology
>>>>>>>>>>> uses
>>>>>>>>>>> the term supercontig and so if my assembly generated scaffolds
>>>>>>>>>>> entitled
>>>>>>>>>>> "scaffold" should I change the names to supercontigs so that chado
>>>>>>>>>>> recognizes the terms?
>>>>>>>>>> Yes. You must use valid SO terms. It is a requirement of GFF3, and
>>>>>>>>>> Chado
>>>>>>>>>> will enforce this requirement on loading a GFF3 file (note Chado
>>>>>>>>>> will
>>>>>>>>>> even
>>>>>>>>>> go as far as to check the validity of the Ontology_term= attribute
>>>>>>>>>> in
>>>>>>>>>> GFF3
>>>>>>>>>> if you use it). You can decide to use contig or supercontig as your
>>>>>>>>>> sequence feature. It doesnšt really matter unless you are placing
>>>>>>>>>>
>>>>>>>>>> both
>>>>>>>>>> into the database as separate features (i.e. You have a supercontig
>>>>>>>>>> as
>>>>>>>>>> the
>>>>>>>>>> parent feature and then you enter contigs individually as children
>>>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>> supercontig).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Corresponding to my first question, Maker does not know that the
>>>>>>>>>>> contigs
>>>>>>>>>>> are actually scaffold/supercontigs when annotating and so Maker
>>>>>>>>>>> will
>>>>>>>>>>> still call the "type" feature or column 3 in the GFF3, a 'contig',
>>>>>>>>>>> how
>>>>>>>>>>> can Maker be implemented to change this naming convention before
>>>>>>>>>>> annotation, or after?
>>>>>>>>>> Not really important unless you plan on making contigs children of
>>>>>>>>>> the
>>>>>>>>>> supercontig. But you can always do a search and replace. -->
>>>>>>>>>> cat file.gff | perl -ane 's/\tcontig\t/\tsupercontig\t/s; print
>>>>>>>>>> $_'>
>>>>>>>>>> new_file.gff
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Consequently, I am having problems pulling up gene features in
>>>>>>>>>>> Gbrowse
>>>>>>>>>>> when doing a generic gene search, and I must provide the maker
>>>>>>>>>>> generated
>>>>>>>>>>> unique-gene_id in the gbrowse search bar or the known sequence id
>>>>>>>>>>> i.e
>>>>>>>>>>> 'scaffold001', which is not useful for someone who does not have
>>>>>>>>>>> this
>>>>>>>>>>> information.
>>>>>>>>>>> ---- I do not have this problem when my seq_id, and 'type' feature
>>>>>>>>>>> id
>>>>>>>>>>> match in the true case of 'contigs'. I can do a generic gene
>>>>>>>>>>> search
>>>>>>>>>>> in
>>>>>>>>>>> gbrowse with the term 'maker' and gbrowse will provide me all the
>>>>>>>>>>> associated maker generated gene calls.
>>>>>>>>>> See "Adjusting GBrowse Name Searches" in the GBrowse tutorial -->
>>>>>>>>>> http://gmod.org/gbrowse2/tutorial/tutorial.html#naming
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Carson
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Thank you for any guidance resolving these concerns,
>>>>>>>>>>> Claudia
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Claudia DiNatale
>>>>>>>>>>> Master's Candidate
>>>>>>>>>>> The Crosby Lab
>>>>>>>>>>> University of Windsor
>>>>>>>>>>> 519-253-3000 ext: 4755
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>> [hidden email]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> maker-devel mailing list
>>>>>>>>>> [hidden email]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Claudia DiNatale
>>>>>>>> Master's Candidate
>>>>>>>> The Crosby Lab
>>>>>>>> University of Windsor
>>>>>>>> 519-253-3000 ext: 4755
>>>>>>>>
>>>> --
>>>> Claudia DiNatale
>>>> Master's Candidate
>>>> The Crosby Lab
>>>> University of Windsor
>>>> 519-253-3000 ext: 4755
>>>>
>>>
>>
>> --
>> Claudia DiNatale
>> Master's Candidate
>> The Crosby Lab
>> University of Windsor
>> 519-253-3000 ext: 4755
>>
>
>


--
Claudia DiNatale
Master's Candidate
The Crosby Lab
University of Windsor
519-253-3000 ext: 4755


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org