Loading Ensembl GFF3 into Chado Schema

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Loading Ensembl GFF3 into Chado Schema

Mike Hugo
Hello,
I'm trying to load Enseml data into a Chado schema but am encountering an error during the load.  Here are the steps I took:

Downloaded the Ensembl HomoSapiens GTF file from the Ensembl FTP site and ran the gtf2gff3 script to convert the GTF to GFF3.
./gtf2gff3 Homo_sapiens.GRCh37.61.gtf > Homo_sapiens.GRCh37.61.gff3

Per the instructions at http://gmod.org/wiki/Load_GFF_Into_Chado I then sorted the file:
gmod_gff3_preprocessor.pl --gfffile Homo_sapiens.GRCh37.61.gff3 --outfile Homo_sapiens.GRCh37.61.sorted.gff3

I also split it into smaller chunks:
gmod_gff3_preprocessor.pl --onlysplit --splitfile 1 --gfffile Homo_sapiens.GRCh37.61.gff3.sorted

Then I tried to load a single file, for example:
gmod_bulk_load_gff3.pl --organism human  --gfffile  GL000247.1.Homo_sapiens.GRCh37.61.gff3.sorted.out.gff3

But I receive this error:

Preparing data for inserting into the chado database
(This may take a while ...)
Unable to find srcfeature GL000247.1 in the database.
Perhaps you need to rerun your data load with the '--recreate_cache' option. at /usr/lib/perl5/site_perl/5.8.8/Bio/GMOD/DB/Adapter.pm line 4555
Bio::GMOD::DB::Adapter::src_second_chance('Bio::GMOD::DB::Adapter=HASH(0xf110900)', 'Bio::SeqFeature::Annotated=HASH(0xf250c60)') called at /usr/bin/gmod_bulk_load_gff3.pl line 841

Abnormal termination, trying to clean up...

Attempting to clean up the loader temp table (so that --recreate_cache
won't be needed)...
Trying to remove the run lock (so that --remove_lock won't be needed)...
Exiting...

The contents of the GL000247.1.Homo_sapiens.GRCh37.61.gff3.sorted.out.gff3  file look like this:
GL000247.1      miRNA   gene    11986   12067   .       -       .       ID=ENSG00000238477;Name=CU442762.1;
GL000247.1      miRNA   gene    11746   11828   .       -       .       ID=ENSG00000238667;Name=CU442762.2;
GL000247.1      miRNA   gene    12803   12884   .       -       .       ID=ENSG00000240620;Name=CU442762.3;
GL000247.1      miRNA   transcript      11986   12067   .       -       .       ID=ENST00000458778;Name=CU442762.1-201;Parent=ENSG00000238477;
GL000247.1      miRNA   transcript      11746   11828   .       -       .       ID=ENST00000459271;Name=CU442762.2-201;Parent=ENSG00000238667;
GL000247.1      miRNA   transcript      12803   12884   .       -       .       ID=ENST00000498467;Name=CU442762.3-201;Parent=ENSG00000240620;
GL000247.1      miRNA   exon    11986   12067   .       -       .       ID=exon:ENST00000458778:1;Parent=ENST00000458778;
GL000247.1      miRNA   exon    11746   11828   .       -       .       ID=exon:ENST00000459271:1;Parent=ENST00000459271;
GL000247.1      miRNA   exon    12803   12884   .       -       .       ID=exon:ENST00000498467:1;Parent=ENST00000498467;

Does anyone know how I can successfully load Ensembl data into Chado?  Any help/pointers is greatly apprecaited!!

Mike

------------------------------------------------------------------------------
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d
_______________________________________________
Gmod-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-devel
Reply | Threaded
Open this post in threaded view
|

Re: Loading Ensembl GFF3 into Chado Schema

Scott Cain
Hi Mike,

Chado requires that the feature on which you a placing other features
(that is, the chromosome or contig, in this case, whatever GL000247.1
is) be defined first.  For this file, you could do it by adding a line
like this:

GL000247.1  ensembl  contig  1  123456   .   .   .
ID=GL000247.1;Name=GL000247.1

though if you have a lot of data like this, it might make sense to
create a file that has all of these reference sequences in them and
load them first.

Scott


On Wed, Mar 16, 2011 at 12:58 PM, Mike Hugo <[hidden email]> wrote:

> Hello,
> I'm trying to load Enseml data into a Chado schema but am encountering an
> error during the load.  Here are the steps I took:
> Downloaded the Ensembl HomoSapiens GTF file from the Ensembl FTP site
> and ran the gtf2gff3 script to convert the GTF to GFF3.
> ./gtf2gff3 Homo_sapiens.GRCh37.61.gtf > Homo_sapiens.GRCh37.61.gff3
> Per the instructions at http://gmod.org/wiki/Load_GFF_Into_Chado I then
> sorted the file:
> gmod_gff3_preprocessor.pl --gfffile Homo_sapiens.GRCh37.61.gff3 --outfile
> Homo_sapiens.GRCh37.61.sorted.gff3
> I also split it into smaller chunks:
> gmod_gff3_preprocessor.pl --onlysplit --splitfile 1 --gfffile
> Homo_sapiens.GRCh37.61.gff3.sorted
> Then I tried to load a single file, for example:
> gmod_bulk_load_gff3.pl --organism human  --gfffile
>  GL000247.1.Homo_sapiens.GRCh37.61.gff3.sorted.out.gff3
> But I receive this error:
> Preparing data for inserting into the chado database
> (This may take a while ...)
> Unable to find srcfeature GL000247.1 in the database.
> Perhaps you need to rerun your data load with the '--recreate_cache' option.
> at /usr/lib/perl5/site_perl/5.8.8/Bio/GMOD/DB/Adapter.pm line 4555
> Bio::GMOD::DB::Adapter::src_second_chance('Bio::GMOD::DB::Adapter=HASH(0xf110900)',
> 'Bio::SeqFeature::Annotated=HASH(0xf250c60)') called at
> /usr/bin/gmod_bulk_load_gff3.pl line 841
>
> Abnormal termination, trying to clean up...
>
> Attempting to clean up the loader temp table (so that --recreate_cache
> won't be needed)...
> Trying to remove the run lock (so that --remove_lock won't be needed)...
> Exiting...
> The contents of
> the GL000247.1.Homo_sapiens.GRCh37.61.gff3.sorted.out.gff3  file look like
> this:
> GL000247.1      miRNA   gene    11986   12067   .       -       .
> ID=ENSG00000238477;Name=CU442762.1;
> GL000247.1      miRNA   gene    11746   11828   .       -       .
> ID=ENSG00000238667;Name=CU442762.2;
> GL000247.1      miRNA   gene    12803   12884   .       -       .
> ID=ENSG00000240620;Name=CU442762.3;
> GL000247.1      miRNA   transcript      11986   12067   .       -       .
>     ID=ENST00000458778;Name=CU442762.1-201;Parent=ENSG00000238477;
> GL000247.1      miRNA   transcript      11746   11828   .       -       .
>     ID=ENST00000459271;Name=CU442762.2-201;Parent=ENSG00000238667;
> GL000247.1      miRNA   transcript      12803   12884   .       -       .
>     ID=ENST00000498467;Name=CU442762.3-201;Parent=ENSG00000240620;
> GL000247.1      miRNA   exon    11986   12067   .       -       .
> ID=exon:ENST00000458778:1;Parent=ENST00000458778;
> GL000247.1      miRNA   exon    11746   11828   .       -       .
> ID=exon:ENST00000459271:1;Parent=ENST00000459271;
> GL000247.1      miRNA   exon    12803   12884   .       -       .
> ID=exon:ENST00000498467:1;Parent=ENST00000498467;
> Does anyone know how I can successfully load Ensembl data into Chado?  Any
> help/pointers is greatly apprecaited!!
> Mike
> ------------------------------------------------------------------------------
> Colocation vs. Managed Hosting
> A question and answer guide to determining the best fit
> for your organization - today and in the future.
> http://p.sf.net/sfu/internap-sfd2d
> _______________________________________________
> Gmod-devel mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-devel
>
>



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d
_______________________________________________
Gmod-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-devel