Annotating a fragmented assembly

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Annotating a fragmented assembly

Lior Glick
Hello there,

I am working on creating plant pan genomes. This means that I produce many assemblies for samples of the same species from NGS data available from SRA and then annotate them with MAKER, based on a collection of relevant evidence (transcripts and proteins).
As you might imagine, data quality is variable, so I sometimes create assembles from >x20 sequencing depth, resulting in fragmented assemblies (say N50 in the range of 5-10kb).
Annotation results of such genomes usually contain many partial genes, broken across contigs, so in many cases I get two proteins, representing the 3' and 5' parts of a broken gene. In other cases, only one part of the gene is detected.
I've also found that applying reference-based scaffolding (I use RaGOO) to generate pseudomolecules improves results by bringing together contigs containing gene parts and allowing MAKER to create full annotation. However, this also results in new erroneous predictions, spanning two contigs that are not actually adjacent in the genome but were brought together by the scaffolding process.
I suspect this has to do with the number of 'N' characters introduced as padding between ordered contigs, so one thing I wanted to ask about is how MAKER reacts to N's in the middle of a gene. Does it affect gene prediction?
I would also appreciate any advice on how to annotate fragmented genomes and comments about the strategy I described above. Please note that I am not expecting a reference-level annotation, but am simply trying to reduce noise levels towards downstream comparative analyses.

Thanks a lot and best regards,
Lior


_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Annotating a fragmented assembly

Carson Holt-2
N’s are handled by the gene predictors themselves. I know Augustus can span N’s within introns. I’m not sure how many N’s will cause it to split the gene. It may be a function of the expected intron length in the HMM. Organisms with large introns could then handles more N’s. Genemark will split genes on even short runs of N’s. I’m not sure on SNAP.  For BLAST alignments, extensions of gaps decrease the score, so how long the gap can be depends on the score of the initial seeding alignment. The larger the initial score, the longer the gap can be before scores drop below the termination threshold.

—Carson


> On Apr 13, 2020, at 8:12 AM, Lior Glick <[hidden email]> wrote:
>
> Hello there,
>
> I am working on creating plant pan genomes. This means that I produce many assemblies for samples of the same species from NGS data available from SRA and then annotate them with MAKER, based on a collection of relevant evidence (transcripts and proteins).
> As you might imagine, data quality is variable, so I sometimes create assembles from >x20 sequencing depth, resulting in fragmented assemblies (say N50 in the range of 5-10kb).
> Annotation results of such genomes usually contain many partial genes, broken across contigs, so in many cases I get two proteins, representing the 3' and 5' parts of a broken gene. In other cases, only one part of the gene is detected.
> I've also found that applying reference-based scaffolding (I use RaGOO) to generate pseudomolecules improves results by bringing together contigs containing gene parts and allowing MAKER to create full annotation. However, this also results in new erroneous predictions, spanning two contigs that are not actually adjacent in the genome but were brought together by the scaffolding process.
> I suspect this has to do with the number of 'N' characters introduced as padding between ordered contigs, so one thing I wanted to ask about is how MAKER reacts to N's in the middle of a gene. Does it affect gene prediction?
> I would also appreciate any advice on how to annotate fragmented genomes and comments about the strategy I described above. Please note that I am not expecting a reference-level annotation, but am simply trying to reduce noise levels towards downstream comparative analyses.
>
> Thanks a lot and best regards,
> Lior
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org