Re: maker problem

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: maker problem

Carson Hinton Holt
GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here —> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments).

Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so —>
contig-dpp-500-500.gff
contig-dpp-500-500.maker.proteins.fasta
contig-dpp-500-500.maker.transcripts.fasta

The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the …/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages).

If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) —>
<a href="https://groups.google.com/forum/#!searchin/maker-devel/maker2zff|sort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ" class="">https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ

You can also browse through the archive for more info on training SNAP and Augustus.

—Carson



On Oct 8, 2018, at 10:12 AM, Gupta, Parul <[hidden email]> wrote:

Hi Carson,
As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don’t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion.
Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? 

Thanks,
Parul

On Oct 4, 2018, at 6:43 PM, Gupta, Parul <[hidden email]> wrote:

Thank you Carson.

Sent from my iPad

On Oct 4, 2018, at 3:11 PM, Carson Holt <[hidden email]> wrote:

You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with.

If you don’t provide a prediction method, MAKER will align evidence, but you won’t get any gene models.

Example:

—Carson

On Oct 1, 2018, at 1:05 PM, Gupta, Parul <[hidden email]> wrote:

Hi Carson,
I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round :
genome=masked_genome.fasta
est=transcripts.fasta (from same species for which genome fasta is provided)
atleast=transcripts.fasta (from alternative organism)
protein=proteins.fasta

Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting?
In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many “RETRY” and “FAILED” scaffolds.
FYI, I subscribed to "maker-devel" google group but "new topic” button is greyed out.

Yours suggestion??

Thanks in advance.

Parul





_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: maker problem

Gupta, Parul
Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. 

Below is the example of my datastore_index.log file for that scaffold :

ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED
ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED


Output directory of that scaffold looks like:

[Linux@waterman ScJhAqd_1%3BHRSCAF=2]$ ll
total 160
drwxr-xr-x 3 guptapa pi     3 Oct  5 15:51 ../
-rw-r--r-- 1 guptapa pi 27740 Oct  5 15:51 run.log
-rw-r--r-- 1 guptapa pi 34268 Oct  5 15:51 ScJhAqd_1%3BHRSCAF=2.gff
drwxr-xr-x 2 guptapa pi    75 Oct  5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/
drwxr-xr-x 3 guptapa pi     5 Oct  5 15:51 ./


gff looks like:

Linux@waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff 
##gff-version 3
ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2;
ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81;
ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86;
ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27;

Regards,
Parul



On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt <[hidden email]> wrote:

GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here —> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments).

Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so —>
contig-dpp-500-500.gff
contig-dpp-500-500.maker.proteins.fasta
contig-dpp-500-500.maker.transcripts.fasta

The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the …/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages).

If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) —>
<a href="https://groups.google.com/forum/#!searchin/maker-devel/maker2zff|sort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ" class="">https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ

You can also browse through the archive for more info on training SNAP and Augustus.

—Carson



On Oct 8, 2018, at 10:12 AM, Gupta, Parul <[hidden email]> wrote:

Hi Carson,
As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don’t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion.
Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? 

Thanks,
Parul

On Oct 4, 2018, at 6:43 PM, Gupta, Parul <[hidden email]> wrote:

Thank you Carson.

Sent from my iPad

On Oct 4, 2018, at 3:11 PM, Carson Holt <[hidden email]> wrote:

You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with.

If you don’t provide a prediction method, MAKER will align evidence, but you won’t get any gene models.

Example:

—Carson

On Oct 1, 2018, at 1:05 PM, Gupta, Parul <[hidden email]> wrote:

Hi Carson,
I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round :
genome=masked_genome.fasta
est=transcripts.fasta (from same species for which genome fasta is provided)
atleast=transcripts.fasta (from alternative organism)
protein=proteins.fasta

Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting?
In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many “RETRY” and “FAILED” scaffolds.
FYI, I subscribed to "maker-devel" google group but "new topic” button is greyed out.

Yours suggestion??

Thanks in advance.

Parul






_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: maker problem

Carson Hinton Holt
Look at the GFF3 particularly gene/mRNA/exon/CDS vs match/match_part features (GFF3 spec).

Does your GFF3 contain gene/mRNA/exon/CDS entries? If not, then your GFF3 has no models (it’s empty even if it does contain match/match_part entries). This means either .1. no predictor was set during the run (i.e. est2genome=1 or protein2genome=1 not set) or 2. evidence alignments or assembly are so poor that no models can be made. Look at the results in a browser. Compare what you see on one of your contigs to what you get when running an example from the tutorials.

Perhaps you provided unassembled mRNA-seq data (maker does not process raw mRNA-seq, it must be assembled first). Perhaps you did not provide a broad protein dataset (UniProt/Swiss-prot is usually a good one to use for example). Or perhaps your assembly is too fragmented and has too many runs of NNNNNN to generate matching ORFs against evidence alignments (look at results in a browser).

—Carson



On Oct 8, 2018, at 11:31 AM, Gupta, Parul <[hidden email]> wrote:

Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. 

Below is the example of my datastore_index.log file for that scaffold :

ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED
ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED


Output directory of that scaffold looks like:

[Linux@waterman ScJhAqd_1%3BHRSCAF=2]$ ll
total 160
drwxr-xr-x 3 guptapa pi     3 Oct  5 15:51 ../
-rw-r--r-- 1 guptapa pi 27740 Oct  5 15:51 run.log
-rw-r--r-- 1 guptapa pi 34268 Oct  5 15:51 ScJhAqd_1%3BHRSCAF=2.gff
drwxr-xr-x 2 guptapa pi    75 Oct  5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/
drwxr-xr-x 3 guptapa pi     5 Oct  5 15:51 ./


gff looks like:

Linux@waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff 
##gff-version 3
ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2;
ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81;
ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86;
ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27;

Regards,
Parul



On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt <[hidden email]> wrote:

GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here —> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments).

Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so —>
contig-dpp-500-500.gff
contig-dpp-500-500.maker.proteins.fasta
contig-dpp-500-500.maker.transcripts.fasta

The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the …/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages).

If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) —>
<a href="https://groups.google.com/forum/#!searchin/maker-devel/maker2zff|sort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ" class="">https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ

You can also browse through the archive for more info on training SNAP and Augustus.

—Carson



On Oct 8, 2018, at 10:12 AM, Gupta, Parul <[hidden email]> wrote:

Hi Carson,
As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don’t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion.
Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? 

Thanks,
Parul

On Oct 4, 2018, at 6:43 PM, Gupta, Parul <[hidden email]> wrote:

Thank you Carson.

Sent from my iPad

On Oct 4, 2018, at 3:11 PM, Carson Holt <[hidden email]> wrote:

You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with.

If you don’t provide a prediction method, MAKER will align evidence, but you won’t get any gene models.

Example:

—Carson

On Oct 1, 2018, at 1:05 PM, Gupta, Parul <[hidden email]> wrote:

Hi Carson,
I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round :
genome=masked_genome.fasta
est=transcripts.fasta (from same species for which genome fasta is provided)
atleast=transcripts.fasta (from alternative organism)
protein=proteins.fasta

Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting?
In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many “RETRY” and “FAILED” scaffolds.
FYI, I subscribed to "maker-devel" google group but "new topic” button is greyed out.

Yours suggestion??

Thanks in advance.

Parul







_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: maker problem

Carson Hinton Holt
Also run BUSCO on your assembly. It will give you an estimate of how complete/incomplete your genome assembly is. Also make sure you are running on a genome assembly and not a transcriptome assembly (MAKER does not annotate transcriptomes).

—Carson


On Oct 8, 2018, at 11:45 AM, Carson Holt <carson.holt@genetics..utah.edu> wrote:

Look at the GFF3 particularly gene/mRNA/exon/CDS vs match/match_part features (GFF3 spec).

Does your GFF3 contain gene/mRNA/exon/CDS entries? If not, then your GFF3 has no models (it’s empty even if it does contain match/match_part entries). This means either .1. no predictor was set during the run (i.e. est2genome=1 or protein2genome=1 not set) or 2. evidence alignments or assembly are so poor that no models can be made. Look at the results in a browser. Compare what you see on one of your contigs to what you get when running an example from the tutorials.

Perhaps you provided unassembled mRNA-seq data (maker does not process raw mRNA-seq, it must be assembled first). Perhaps you did not provide a broad protein dataset (UniProt/Swiss-prot is usually a good one to use for example). Or perhaps your assembly is too fragmented and has too many runs of NNNNNN to generate matching ORFs against evidence alignments (look at results in a browser).

—Carson



On Oct 8, 2018, at 11:31 AM, Gupta, Parul <[hidden email]> wrote:

Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. 

Below is the example of my datastore_index.log file for that scaffold :

ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED
ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED


Output directory of that scaffold looks like:

[Linux@waterman ScJhAqd_1%3BHRSCAF=2]$ ll
total 160
drwxr-xr-x 3 guptapa pi     3 Oct  5 15:51 ../
-rw-r--r-- 1 guptapa pi 27740 Oct  5 15:51 run.log
-rw-r--r-- 1 guptapa pi 34268 Oct  5 15:51 ScJhAqd_1%3BHRSCAF=2.gff
drwxr-xr-x 2 guptapa pi    75 Oct  5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/
drwxr-xr-x 3 guptapa pi     5 Oct  5 15:51 ./


gff looks like:

Linux@waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff 
##gff-version 3
ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2;
ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81;
ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86;
ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27;

Regards,
Parul



On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt <[hidden email]> wrote:

GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here —> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments).

Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so —>
contig-dpp-500-500.gff
contig-dpp-500-500.maker.proteins.fasta
contig-dpp-500-500.maker.transcripts.fasta

The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the …/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages).

If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) —>
<a href="https://groups.google.com/forum/#!searchin/maker-devel/maker2zff|sort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ" class="">https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ

You can also browse through the archive for more info on training SNAP and Augustus.

—Carson



On Oct 8, 2018, at 10:12 AM, Gupta, Parul <[hidden email]> wrote:

Hi Carson,
As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don’t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion.
Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? 

Thanks,
Parul

On Oct 4, 2018, at 6:43 PM, Gupta, Parul <[hidden email]> wrote:

Thank you Carson.

Sent from my iPad

On Oct 4, 2018, at 3:11 PM, Carson Holt <[hidden email]> wrote:

You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with.

If you don’t provide a prediction method, MAKER will align evidence, but you won’t get any gene models.

Example:

—Carson

On Oct 1, 2018, at 1:05 PM, Gupta, Parul <[hidden email]> wrote:

Hi Carson,
I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round :
genome=masked_genome.fasta
est=transcripts.fasta (from same species for which genome fasta is provided)
atleast=transcripts.fasta (from alternative organism)
protein=proteins.fasta

Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting?
In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many “RETRY” and “FAILED” scaffolds.
FYI, I subscribed to "maker-devel" google group but "new topic” button is greyed out.

Yours suggestion??

Thanks in advance.

Parul








_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: maker problem

Gupta, Parul
ok, let me explain my case.
Genome- eukaryote
We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl.

Transcripts-
We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl.  These assembled transcripts may have redundancy.

Proteins-
I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl.

atleast=transcripts.fasta (from in-house sequenced genome (already published))

est2genome=1
protein2genome=1

Sorry for not explaining my case initially. What can be other files I can use as est evidence? Can I use Augustus generated hints for gene prediction along with above options?
Your thoughts??

Parul



On Oct 8, 2018, at 1:08 PM, Carson Hinton Holt <[hidden email]> wrote:

Also run BUSCO on your assembly. It will give you an estimate of how complete/incomplete your genome assembly is. Also make sure you are running on a genome assembly and not a transcriptome assembly (MAKER does not annotate transcriptomes).

—Carson


On Oct 8, 2018, at 11:45 AM, Carson Holt <carson.holt@genetics..utah.edu> wrote:

Look at the GFF3 particularly gene/mRNA/exon/CDS vs match/match_part features (GFF3 spec).

Does your GFF3 contain gene/mRNA/exon/CDS entries? If not, then your GFF3 has no models (it’s empty even if it does contain match/match_part entries). This means either .1. no predictor was set during the run (i.e. est2genome=1 or protein2genome=1 not set) or 2. evidence alignments or assembly are so poor that no models can be made. Look at the results in a browser. Compare what you see on one of your contigs to what you get when running an example from the tutorials.

Perhaps you provided unassembled mRNA-seq data (maker does not process raw mRNA-seq, it must be assembled first). Perhaps you did not provide a broad protein dataset (UniProt/Swiss-prot is usually a good one to use for example). Or perhaps your assembly is too fragmented and has too many runs of NNNNNN to generate matching ORFs against evidence alignments (look at results in a browser).

—Carson



On Oct 8, 2018, at 11:31 AM, Gupta, Parul <[hidden email]> wrote:

Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds. 

Below is the example of my datastore_index.log file for that scaffold :

ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED
ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED


Output directory of that scaffold looks like:

[Linux@waterman ScJhAqd_1%3BHRSCAF=2]$ ll
total 160
drwxr-xr-x 3 guptapa pi     3 Oct  5 15:51 ../
-rw-r--r-- 1 guptapa pi 27740 Oct  5 15:51 run.log
-rw-r--r-- 1 guptapa pi 34268 Oct  5 15:51 ScJhAqd_1%3BHRSCAF=2.gff
drwxr-xr-x 2 guptapa pi    75 Oct  5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/
drwxr-xr-x 3 guptapa pi     5 Oct  5 15:51 ./


gff looks like:

Linux@waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff 
##gff-version 3
ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2;
ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81;
ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86;
ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27;

Regards,
Parul



On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt <[hidden email]> wrote:

GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here —> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments).

Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so —>
contig-dpp-500-500.gff
contig-dpp-500-500.maker.proteins.fasta
contig-dpp-500-500.maker.transcripts.fasta

The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the …/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages).

If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) —>
<a href="https://groups.google.com/forum/#!searchin/maker-devel/maker2zff|sort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ" class="">https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ

You can also browse through the archive for more info on training SNAP and Augustus.

—Carson



On Oct 8, 2018, at 10:12 AM, Gupta, Parul <[hidden email]> wrote:

Hi Carson,
As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don’t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion.
Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ? 

Thanks,
Parul

On Oct 4, 2018, at 6:43 PM, Gupta, Parul <[hidden email]> wrote:

Thank you Carson.

Sent from my iPad

On Oct 4, 2018, at 3:11 PM, Carson Holt <[hidden email]> wrote:

You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with.

If you don’t provide a prediction method, MAKER will align evidence, but you won’t get any gene models.

Example:

—Carson

On Oct 1, 2018, at 1:05 PM, Gupta, Parul <[hidden email]> wrote:

Hi Carson,
I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round :
genome=masked_genome.fasta
est=transcripts.fasta (from same species for which genome fasta is provided)
atleast=transcripts.fasta (from alternative organism)
protein=proteins.fasta

Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting?
In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many “RETRY” and “FAILED” scaffolds.
FYI, I subscribed to "maker-devel" google group but "new topic” button is greyed out.

Yours suggestion??

Thanks in advance.

Parul









_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: maker problem

Carson Holt-2

> We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl.

Let MAKER run masking if possible. Also BUSCO can be used to train Augustus which can then become the gene predictor in MAKER.


> Transcripts-
> We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl.  These assembled transcripts may have redundancy.

est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training.


> Proteins-
> I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl.

Look at the contigs in a browser. Find a contig with protein2genome results in the GFF3  (i.e. the column is marked protein2genome in the GFF3), and look at it specifically. If you don’t find any, then the issue is either your pre-masking or the evidence proteins you gave. I’d recommend using UniProt/Swiss-Prot which conains a broad set of curated and conserved proteins.


> atleast=transcripts.fasta (from in-house sequenced genome (already published))

These will being ignored until you have a trained HMM (this type of alignment can only be used as hints to the trained predictor).

—Carson


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: maker problem

Gupta, Parul
 I used Augustus to generate training set (separately from maker) based on transcripts (fasta) so how I can use that Augustus generated trained data (hints in gff3 format) in maker for gene prediction? I can see only Augustus species option there in maker_opts.ctl. Which option I need to turn on in opts.ctl to put Augustus generated hints file?  I have augustus.gff as predicted hints.

est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training.

I used est_fasta not the est_gff.

Find a contig with protein2genome results in the GFF3 

yes I can see protein2genome results in gff3:

ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 31566 32621 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 31566 31775 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532540;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 82 154;Gap=M14 I3 M56;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 31872 32621 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532541;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 155 409;Gap=M126 I5 M124;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 33816 35829 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 34916 35829 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532542;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 41 343;Gap=M27 D1 M276 F2;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 33816 34182 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532543;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 344 466;Gap=R2 M123;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 49636 51466 1091 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 51354 51466 1091 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532544;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA;Target=Mlong585_07901-RA 1 36;Gap=M20 D1 M16 F2;

and est2genome in gff3 as well:

ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48887305 48890708 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48887305 48889881 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871792;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48889982 48890708 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871793;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 2591 3317 +;Gap=M727;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48887305 48890708 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48887305 48889881 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871794;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48889949 48890708 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871795;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 2591 3350 +;Gap=M760;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48895479 48899036 9582 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547165;Name=Sh_Salba_v2_108280;

Thanks,
Parul

On Oct 8, 2018, at 3:11 PM, Carson Holt <[hidden email]> wrote:


We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl.

Let MAKER run masking if possible. Also BUSCO can be used to train Augustus which can then become the gene predictor in MAKER.


Transcripts-
We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl.  These assembled transcripts may have redundancy.

est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training.


Proteins-
I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl.

Look at the contigs in a browser. Find a contig with protein2genome results in the GFF3  (i.e. the column is marked protein2genome in the GFF3), and look at it specifically. If you don’t find any, then the issue is either your pre-masking or the evidence proteins you gave. I’d recommend using UniProt/Swiss-Prot which conains a broad set of curated and conserved proteins.


atleast=transcripts.fasta (from in-house sequenced genome (already published))

These will being ignored until you have a trained HMM (this type of alignment can only be used as hints to the trained predictor).

—Carson



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: maker problem

Carson Holt-2
Once Augustus is trained it will have a new species directory under …/augustus/config/species/ for the organism you just trained. Or if you trained augustus elsewhere (website, BUSCO, etc.) you have to copy the species data there. Then you just supply the species name and Augustus automatically finds it (see Augustus documentation on training).

For est2genome=1 and protein2genome=1, MAKER takes the alignments from exonerate protein2genome and est2genome and if they are mostly open reading frame, just turns them directly into gene/mRNA/exon/CDS models. If there are none of those in the resulting GFF3 but there are est2genome and protein2genome alignments then all of them have broken ORF. That means there are serious issues with your assembly, or with the est fasta or protein fasta file. For a protein fasta, I recomend using uniprot/swissprot because it is manually curated and contains a broad dataset. But if you cannot get gene models from uniprot/swissprot protein2genome alignments, then your assembly has issues (either too fragmented, lots of errors inducing random stop codons, or lots of N’s interspersed in the sequence).

—Carson



On Oct 8, 2018, at 2:40 PM, Gupta, Parul <[hidden email]> wrote:

 I used Augustus to generate training set (separately from maker) based on transcripts (fasta) so how I can use that Augustus generated trained data (hints in gff3 format) in maker for gene prediction? I can see only Augustus species option there in maker_opts.ctl. Which option I need to turn on in opts.ctl to put Augustus generated hints file?  I have augustus.gff as predicted hints.

est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training.

I used est_fasta not the est_gff.

Find a contig with protein2genome results in the GFF3 

yes I can see protein2genome results in gff3:

ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 31566 32621 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 31566 31775 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532540;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 82 154;Gap=M14 I3 M56;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 31872 32621 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532541;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 155 409;Gap=M126 I5 M124;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 33816 35829 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 34916 35829 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532542;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 41 343;Gap=M27 D1 M276 F2;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 33816 34182 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532543;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 344 466;Gap=R2 M123;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 49636 51466 1091 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 51354 51466 1091 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532544;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA;Target=Mlong585_07901-RA 1 36;Gap=M20 D1 M16 F2;

and est2genome in gff3 as well:

ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48887305 48890708 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48887305 48889881 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871792;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48889982 48890708 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871793;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 2591 3317 +;Gap=M727;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48887305 48890708 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48887305 48889881 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871794;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48889949 48890708 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871795;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 2591 3350 +;Gap=M760;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48895479 48899036 9582 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547165;Name=Sh_Salba_v2_108280;

Thanks,
Parul

On Oct 8, 2018, at 3:11 PM, Carson Holt <[hidden email]> wrote:


We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl.

Let MAKER run masking if possible. Also BUSCO can be used to train Augustus which can then become the gene predictor in MAKER.


Transcripts-
We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl.  These assembled transcripts may have redundancy.

est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training.


Proteins-
I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl.

Look at the contigs in a browser. Find a contig with protein2genome results in the GFF3  (i.e. the column is marked protein2genome in the GFF3), and look at it specifically. If you don’t find any, then the issue is either your pre-masking or the evidence proteins you gave. I’d recommend using UniProt/Swiss-Prot which conains a broad set of curated and conserved proteins.


atleast=transcripts.fasta (from in-house sequenced genome (already published))

These will being ignored until you have a trained HMM (this type of alignment can only be used as hints to the trained predictor).

—Carson




_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org