how to input a masked assembly for annotation into Maker

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

how to input a masked assembly for annotation into Maker

Valerie Soza
Hi Maker community

I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly.

Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above.

For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains).

I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file.
I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong?

Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch.


Annotation A default build steps:

$ maker -base Rwill10 -fix_nucleotides
$ maker -base Rwill10 -fix_nucleotides

$ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11983   11983  312159
#should be 11985

$ maker -dsindex -base Rwill10 -fix_nucleotides

$ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11985   11985  312211

$ gff3_merge -d Rwill10_master_datastore_index.log

$ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff
21960

$ fasta_merge -d Rwill10_master_datastore_index.log





Annotation A standard build steps:

$ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta

#genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file
#IDs in .tsv file are called "processed-gene" from .fasta file,
#but in .gff file, I think these are called "abinit-gene"
#best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff
$ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

#extract list of IDs only to grep for
cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

$ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

$ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
  1599   1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
 
#used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff

$ maker -base Rwill10standard2 -fix_nucleotides
$ maker -base Rwill10standard2 -fix_nucleotides

$ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11975   11975  311953
#should be 11985

$ maker -dsindex -base Rwill10standard2 -fix_nucleotides

$ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11985   11985  312211

$ gff3_merge -d Rwill10standard2_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10standard2.all.gff
23559

$ fasta_merge -d Rwill10standard2_master_datastore_index.log





Annotation B default build steps:

$ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta

#Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header
$ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks

#use script to extract ordered scaffolds for each chromosome
$ ./extract_scaffolds_synteny.sh

#use script to create pseudochromosomal sequence for each chromosome
$ ./create_pseudo_chromosome_allLGs.sh

#concatenate these into one fasta file
cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta

$ maker -base Rwill10.pseudochromos -fix_nucleotides
$ maker -base Rwill10.pseudochromos -fix_nucleotides

$ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc
     13      13     312

$ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff
18465

$ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log





Annotation B standard build steps:

$ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta

$ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

$ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

$ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

$ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
  2365   2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs

#used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff

$ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
$ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides

$ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
     13      13     312

$ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff
20830





-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

maker_opts.log.AnnotationA.default.log (4K) Download Attachment
maker_opts.log.AnnotationA.standard.log (4K) Download Attachment
maker_opts.log.AnnotationB.default.log (4K) Download Attachment
maker_opts.log.AnnotationB.standard.log (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: how to input a masked assembly for annotation into Maker

Carson Holt-2
Predictors will call partial models near the edge of contigs, but if you merger them with 100 N gaps, they will be more likely to jump the gap (attempt to merge models on each side while putting the gap in the intron - even if that is not correct), or they will just call nothing around the gap (i.e. they will behave different around gaps than at the edge of contigs).

—Carson



> On Jun 1, 2018, at 1:36 PM, Valerie Soza <[hidden email]> wrote:
>
> Hi Maker community
>
> I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly.
>
> Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above.
>
> For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains).
>
> I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file.
> I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong?
>
> Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch.
>
>
> Annotation A default build steps:
>
> $ maker -base Rwill10 -fix_nucleotides
> $ maker -base Rwill10 -fix_nucleotides
>
> $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11983   11983  312159
> #should be 11985
>
> $ maker -dsindex -base Rwill10 -fix_nucleotides
>
> $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11985   11985  312211
>
> $ gff3_merge -d Rwill10_master_datastore_index.log
>
> $ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff
> 21960
>
> $ fasta_merge -d Rwill10_master_datastore_index.log
>
> <maker_opts.log.AnnotationA.default.log>
>
>
> Annotation A standard build steps:
>
> $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta
>
> #genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file
> #IDs in .tsv file are called "processed-gene" from .fasta file,
> #but in .gff file, I think these are called "abinit-gene"
> #best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff
> $ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed
>
> #extract list of IDs only to grep for
> cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs
>
> $ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  
>
> $ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
>  1599   1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
>
> #used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff
>
> $ maker -base Rwill10standard2 -fix_nucleotides
> $ maker -base Rwill10standard2 -fix_nucleotides
>
> $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11975   11975  311953
> #should be 11985
>
> $ maker -dsindex -base Rwill10standard2 -fix_nucleotides
>
> $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11985   11985  312211
>
> $ gff3_merge -d Rwill10standard2_master_datastore_index.log
>
> $ grep -cP '\tgene\t' Rwill10standard2.all.gff
> 23559
>
> $ fasta_merge -d Rwill10standard2_master_datastore_index.log
>
> <maker_opts.log.AnnotationA.standard.log>
>
>
> Annotation B default build steps:
>
> $ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta
>
> #Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header
> $ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks
>
> #use script to extract ordered scaffolds for each chromosome
> $ ./extract_scaffolds_synteny.sh
>
> #use script to create pseudochromosomal sequence for each chromosome
> $ ./create_pseudo_chromosome_allLGs.sh
>
> #concatenate these into one fasta file
> cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta
>
> $ maker -base Rwill10.pseudochromos -fix_nucleotides
> $ maker -base Rwill10.pseudochromos -fix_nucleotides
>
> $ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>     13      13     312
>
> $ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log
>
> $ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff
> 18465
>
> $ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log
>
> <maker_opts.log.AnnotationB.default.log>
>
>
> Annotation B standard build steps:
>
> $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta
>
> $ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed
>
> $ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs
>
> $ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  
>
> $ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
>  2365   2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
>
> #used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff
>
> $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
> $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
>
> $ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>     13      13     312
>
> $ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log
>
> $ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff
> 20830
>
> <maker_opts.log.AnnotationB.standard.log>
>
>
> -Valerie
>
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org