Missing genes in lift-over with est2genome

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Missing genes in lift-over with est2genome

Lior Glick
Hello,
I am using MAKER to annotate a plant genome assembly. A high-quality reference genome and annotation exists for another variety of the same species, so my first step is lifting over reference genes to my genome. I do this by setting est2genome = 1 and providing MAKER with the reference cDNA (transcriptome). No other evidence is provided and no prediction is performed. Repeat masking is done using the reference repeats library.
When checking the results, I found out lots of reference genes missing from the lift-over result. However, if I blast the sequences of these genes myself, I get good matches. I even see these matches when I look at the blast results buried in the MAKER data_store.
For example, a transcript of length 1077 got a match of length 855 - 100% identity and no gaps. Bitscore was 1709 and E-value 0. This looks like a pretty good match, but it is not found in the final MAKER results (gff/fasta).
Why is this happening? Are there some cutoffs that are not satisfied? If so, what are they and how can they be configured?

Thanks,
Lior

_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Missing genes in lift-over with est2genome

Carson Holt-2
There are percent cutoffs for the est2genome algorithm you can set in the maker_bopts.ctl file. Additionally, maker will give the alignment but not produce a gene model if it can’t translate through the est2genome alignment (i.e. stop codons in the assembly). I believe the cutoff is 50%. If you add est_forward=1 to the maker_opts.ctl file names will be copied from the alignment source and the score in the GFF3 column will be the percent match to the original transcript.

—Carson



> On Apr 21, 2020, at 7:08 AM, Lior Glick <[hidden email]> wrote:
>
> Hello,
> I am using MAKER to annotate a plant genome assembly. A high-quality reference genome and annotation exists for another variety of the same species, so my first step is lifting over reference genes to my genome. I do this by setting est2genome = 1 and providing MAKER with the reference cDNA (transcriptome). No other evidence is provided and no prediction is performed. Repeat masking is done using the reference repeats library.
> When checking the results, I found out lots of reference genes missing from the lift-over result. However, if I blast the sequences of these genes myself, I get good matches. I even see these matches when I look at the blast results buried in the MAKER data_store.
> For example, a transcript of length 1077 got a match of length 855 - 100% identity and no gaps. Bitscore was 1709 and E-value 0. This looks like a pretty good match, but it is not found in the final MAKER results (gff/fasta).
> Why is this happening? Are there some cutoffs that are not satisfied? If so, what are they and how can they be configured?
>
> Thanks,
> Lior
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Missing genes in lift-over with est2genome

Lior Glick
Thanks Carson - your answer was very helpful.
Another question related to the lift-over process, if I may.
I want to take the resulting gff and pass it on to another MAKER run, where I provide further, lower confidence evidence (ESTs and proteins). I'm not sure which option to use though. According to this helpful post, I tried using pred_gff and model_gff, but both created cases of fusion genes when genes are very adjacent to one another (see attached picture), even with the correct_est_fusion parameter enabled. It looks like the only way to take lifted-over genes "as-is" would be to use other_gff, but I figure that this was not really intended for genes. Would you recommend this usage? Am I missing something?
Thank you!

‫בתאריך יום ה׳, 23 באפר׳ 2020 ב-20:43 מאת ‪Carson Holt‬‏ <‪[hidden email]‬‏>:‬
There are percent cutoffs for the est2genome algorithm you can set in the maker_bopts.ctl file. Additionally, maker will give the alignment but not produce a gene model if it can’t translate through the est2genome alignment (i.e. stop codons in the assembly). I believe the cutoff is 50%. If you add est_forward=1 to the maker_opts.ctl file names will be copied from the alignment source and the score in the GFF3 column will be the percent match to the original transcript.

—Carson



> On Apr 21, 2020, at 7:08 AM, Lior Glick <[hidden email]> wrote:
>
> Hello,
> I am using MAKER to annotate a plant genome assembly. A high-quality reference genome and annotation exists for another variety of the same species, so my first step is lifting over reference genes to my genome. I do this by setting est2genome = 1 and providing MAKER with the reference cDNA (transcriptome). No other evidence is provided and no prediction is performed. Repeat masking is done using the reference repeats library.
> When checking the results, I found out lots of reference genes missing from the lift-over result. However, if I blast the sequences of these genes myself, I get good matches. I even see these matches when I look at the blast results buried in the MAKER data_store.
> For example, a transcript of length 1077 got a match of length 855 - 100% identity and no gaps. Bitscore was 1709 and E-value 0. This looks like a pretty good match, but it is not found in the final MAKER results (gff/fasta).
> Why is this happening? Are there some cutoffs that are not satisfied? If so, what are they and how can they be configured?
>
> Thanks,
> Lior
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

fusion.png (44K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Missing genes in lift-over with est2genome

Carson Holt-2
If using the est_forward=1 options for the leftover, you can also anchor a search to a specific contig or region by adding a tag to the fasta header (  maker_coor=contig:1-1000; ). The tag will force Exonerate to only run on that region. Sometimes that can rescue a model. 

When you pass results into model_gff=, it will leave them unchanged. It just accepts or rejects them as is. But the model itself is considered evidence, and can alter clustering.  Other_gff= just passes things through with no processing or evaluation (it’s like cut and paste).  You can also try deFusion on result models for resolving gene fusions —> https://wjidea.github.io/defusion/Introduction.html

—Carson

On Apr 30, 2020, at 6:58 AM, Lior Glick <[hidden email]> wrote:

Thanks Carson - your answer was very helpful.
Another question related to the lift-over process, if I may.
I want to take the resulting gff and pass it on to another MAKER run, where I provide further, lower confidence evidence (ESTs and proteins). I'm not sure which option to use though. According to this helpful post, I tried using pred_gff and model_gff, but both created cases of fusion genes when genes are very adjacent to one another (see attached picture), even with the correct_est_fusion parameter enabled. It looks like the only way to take lifted-over genes "as-is" would be to use other_gff, but I figure that this was not really intended for genes. Would you recommend this usage? Am I missing something?
Thank you!

‫בתאריך יום ה׳, 23 באפר׳ 2020 ב-20:43 מאת ‪Carson Holt‬‏ <‪[hidden email]‬‏>:‬
There are percent cutoffs for the est2genome algorithm you can set in the maker_bopts.ctl file. Additionally, maker will give the alignment but not produce a gene model if it can’t translate through the est2genome alignment (i.e. stop codons in the assembly). I believe the cutoff is 50%. If you add est_forward=1 to the maker_opts.ctl file names will be copied from the alignment source and the score in the GFF3 column will be the percent match to the original transcript.

—Carson



> On Apr 21, 2020, at 7:08 AM, Lior Glick <[hidden email]> wrote:
>
> Hello,
> I am using MAKER to annotate a plant genome assembly. A high-quality reference genome and annotation exists for another variety of the same species, so my first step is lifting over reference genes to my genome. I do this by setting est2genome = 1 and providing MAKER with the reference cDNA (transcriptome). No other evidence is provided and no prediction is performed. Repeat masking is done using the reference repeats library.
> When checking the results, I found out lots of reference genes missing from the lift-over result. However, if I blast the sequences of these genes myself, I get good matches. I even see these matches when I look at the blast results buried in the MAKER data_store.
> For example, a transcript of length 1077 got a match of length 855 - 100% identity and no gaps. Bitscore was 1709 and E-value 0. This looks like a pretty good match, but it is not found in the final MAKER results (gff/fasta).
> Why is this happening? Are there some cutoffs that are not satisfied? If so, what are they and how can they be configured?
>
> Thanks,
> Lior
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

<fusion.png>


_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org