Loading Backport

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Loading Backport

Daniel Quest
All,

We are trying to get a deeper understanding of how chado stores fully
annotated genomes.  So we back-ported a recently annotated genome
(Attached).    Then the script said:

# gmod_bulk_load_gff3.pl -a -g therJR.gff3 --noexon
...
feature found for Cthe_3237.p01, org_id:19 when trying to add sequence
at /usr/local/share/perl/5.10.0/Bio/GMOD/DB/Adapter.pm line 2527,
<GEN0> line 16700.
No feature found for Cthe_3238.p01, org_id:19 when trying to add
sequence at /usr/local/share/perl/5.10.0/Bio/GMOD/DB/Adapter.pm line
2527, <GEN0> line 16701.
Loading data into feature table ...
Loading data into featureloc table ...
Loading data into feature_relationship table ...
Loading data into featureprop table ...
Skipping feature_cvterm table since the load file is empty...
Loading data into synonym table ...
Loading data into feature_synonym table ...
Loading data into dbxref table ...
Loading data into feature_dbxref table ...
Loading data into analysisfeature table ...
Loading data into cvterm table ...
Loading data into db table ...
Skipping cv table since the load file is empty...
Loading data into analysis table ...
Skipping organism table since the load file is empty...
Adding cvtermprop=MapReferenceType for 'region' ...
Adding cvtermprop=MapReferenceType to region ...
Loading sequences (if any) ...

Done.


I am guessing it did not work.  Any ideas?
-Daniel

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema

Cthe27405.gff.gz (2M) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Loading Backport

Scott Cain
Hi Daniel,

I would say based on the output that all of the data in the gff file
was successfully loaded, except for some or all of the residue data.
It's not clear to me why that failed--I'll have to experiment to see
if I can reproduce the problem.

Scott

On Thursday, July 1, 2010, Daniel Quest <[hidden email]> wrote:

> All,
>
> We are trying to get a deeper understanding of how chado stores fully
> annotated genomes.  So we back-ported a recently annotated genome
> (Attached).    Then the script said:
>
> # gmod_bulk_load_gff3.pl -a -g therJR.gff3 --noexon
> ...
> feature found for Cthe_3237.p01, org_id:19 when trying to add sequence
> at /usr/local/share/perl/5.10.0/Bio/GMOD/DB/Adapter.pm line 2527,
> <GEN0> line 16700.
> No feature found for Cthe_3238.p01, org_id:19 when trying to add
> sequence at /usr/local/share/perl/5.10.0/Bio/GMOD/DB/Adapter.pm line
> 2527, <GEN0> line 16701.
> Loading data into feature table ...
> Loading data into featureloc table ...
> Loading data into feature_relationship table ...
> Loading data into featureprop table ...
> Skipping feature_cvterm table since the load file is empty...
> Loading data into synonym table ...
> Loading data into feature_synonym table ...
> Loading data into dbxref table ...
> Loading data into feature_dbxref table ...
> Loading data into analysisfeature table ...
> Loading data into cvterm table ...
> Loading data into db table ...
> Skipping cv table since the load file is empty...
> Loading data into analysis table ...
> Skipping organism table since the load file is empty...
> Adding cvtermprop=MapReferenceType for 'region' ...
> Adding cvtermprop=MapReferenceType to region ...
> Loading sequences (if any) ...
>
> Done.
>
>
> I am guessing it did not work.  Any ideas?
> -Daniel
>

--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Loading Backport

Scott Cain
Hi Daniel,

There are several things about this gff that I would do differently to
get the data to load as desired:

1. Use the --noCDS option for the genbank2gff3 converter; this will
produce gene models that are gene->mRNA->exon,polypeptide, which is
the way it will be stored in Chado anyway.  This is at the root of the
problem with the protein sequences not getting stored: the loader is
converting CDS features to polypeptide features, and the loader then
can't find the CDS features to attach the protein sequences to (since
they don't exist).

2. Deal with the mature_protein_region features differently.  The GFF3
loader does not deal with features grouped by ID, like this:

CP000568  GenBank mature_protein_region   2377873 2378196 .    -    .
  ID=Cthe_1998.p01;Parent=Cthe_1998.t01;locus_tag=Cthe_1998;product=conserved
hypothetical protein
CP000568   GenBank mature_protein_region   2378602 2378625 .    -    .
   ID=Cthe_1998.p01;Parent=Cthe_1998.t01;locus_tag=Cthe_1998;product=conserved
hypothetical protein
CP000568   GenBank mature_protein_region   2378197 2378601 .    -    .
   ID=Cthe_1998.p01;Parent=Cthe_1998.t01;locus_tag=Cthe_1998;product=intein

There are a few problems with this representation:

a. They have the transcript as the parent, whereas I think the the CDS
(or better, polypeptide) should be.
b. They are really the same feature (which is what having the same ID
is meant to covey), but the ninth column doesn't have identical
contents.  This presents a problem for storing the data.

To fix this set of GFF, I would remove the ID terms altogether, make
the polypeptide feature the parent, and with this representation, the
fact that the ninth column is not identical is no longer a problem.  I
don't think the general genbank2gff3 tool should necessarily be fixed
to do this automatically, but writing a custom massaging script to run
on the data after it has been produced from the bioperl script
wouldn't be too hard.

Scott


On Thu, Jul 1, 2010 at 5:02 PM, Scott Cain <[hidden email]> wrote:

> Hi Daniel,
>
> I would say based on the output that all of the data in the gff file
> was successfully loaded, except for some or all of the residue data.
> It's not clear to me why that failed--I'll have to experiment to see
> if I can reproduce the problem.
>
> Scott
>
> On Thursday, July 1, 2010, Daniel Quest <[hidden email]> wrote:
>> All,
>>
>> We are trying to get a deeper understanding of how chado stores fully
>> annotated genomes.  So we back-ported a recently annotated genome
>> (Attached).    Then the script said:
>>
>> # gmod_bulk_load_gff3.pl -a -g therJR.gff3 --noexon
>> ...
>> feature found for Cthe_3237.p01, org_id:19 when trying to add sequence
>> at /usr/local/share/perl/5.10.0/Bio/GMOD/DB/Adapter.pm line 2527,
>> <GEN0> line 16700.
>> No feature found for Cthe_3238.p01, org_id:19 when trying to add
>> sequence at /usr/local/share/perl/5.10.0/Bio/GMOD/DB/Adapter.pm line
>> 2527, <GEN0> line 16701.
>> Loading data into feature table ...
>> Loading data into featureloc table ...
>> Loading data into feature_relationship table ...
>> Loading data into featureprop table ...
>> Skipping feature_cvterm table since the load file is empty...
>> Loading data into synonym table ...
>> Loading data into feature_synonym table ...
>> Loading data into dbxref table ...
>> Loading data into feature_dbxref table ...
>> Loading data into analysisfeature table ...
>> Loading data into cvterm table ...
>> Loading data into db table ...
>> Skipping cv table since the load file is empty...
>> Loading data into analysis table ...
>> Skipping organism table since the load file is empty...
>> Adding cvtermprop=MapReferenceType for 'region' ...
>> Adding cvtermprop=MapReferenceType to region ...
>> Loading sequences (if any) ...
>>
>> Done.
>>
>>
>> I am guessing it did not work.  Any ideas?
>> -Daniel
>>
>
> --
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                   scott at scottcain dot net
> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> Ontario Institute for Cancer Research
>



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema