Fewer gene models output with a superset of EST evidence

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Fewer gene models output with a superset of EST evidence

Bob Zimmermann
Hi Maker Developers,

I have been playing around with several data sets as input to annotate our newly reassembled genome. We have 3 RNA seq datasets which have been assembled into de novo transcripts using Trinity. These are input into the maker pipeline along with protein evidence. What is strange is that when I run maker with the de novo transcripts from a single set, I optain more maker transcripts than when I run with a combined set (1619 vs 1450 on one chromosome) and they are longer (median transcript length 1619 vs 1450, IQR 872-2160 vs 667-2026). It might make sense if they were more and shorter if the additional evidence was joining transcripts, but this would indicate that it is not the case.

Therefore I’m trying to understand the algorithm. From what I understand if it finds evidence for an ab initio prediction for which the internal splice junctions agree, then it is considered for improvement. Why, then, if my combined set is a strict superset of the single set, do i get more transcripts with the single set?

Thanks for your help!

Best,
Bob



Department of Molecular Evolution and Development
Universität Wien
Althanstraße 14 (UZA I), Zimmer 2.019
1090 Vienna
Austria

+43 1 427757002


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Fewer gene models output with a superset of EST evidence

Bob Zimmermann
Correction to the above numbers, the median lengths are 1414 and 1256.

> On 19 Oct 2017, at 17:25, Bob Zimmermann <[hidden email]> wrote:
>
> Hi Maker Developers,
>
> I have been playing around with several data sets as input to annotate our newly reassembled genome. We have 3 RNA seq datasets which have been assembled into de novo transcripts using Trinity. These are input into the maker pipeline along with protein evidence. What is strange is that when I run maker with the de novo transcripts from a single set, I optain more maker transcripts than when I run with a combined set (1619 vs 1450 on one chromosome) and they are longer (median transcript length 1619 vs 1450, IQR 872-2160 vs 667-2026). It might make sense if they were more and shorter if the additional evidence was joining transcripts, but this would indicate that it is not the case.
>
> Therefore I’m trying to understand the algorithm. From what I understand if it finds evidence for an ab initio prediction for which the internal splice junctions agree, then it is considered for improvement. Why, then, if my combined set is a strict superset of the single set, do i get more transcripts with the single set?
>
> Thanks for your help!
>
> Best,
> Bob
>
> —
>
> Department of Molecular Evolution and Development
> Universität Wien
> Althanstraße 14 (UZA I), Zimmer 2.019
> 1090 Vienna
> Austria
>
> +43 1 427757002
>


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Fewer gene models output with a superset of EST evidence

Carson Holt-2
You should look at both in a browser to get a better idea of what’s going on. What MAKER does is take the evidence given, clusters it (strand specific clustering) then uses the transcript evidence as intron hints to the predictors and protein alignments as exon hints (will also use polished protein hints to generate intron hints in the absence of transcript intron hints). Finally it uses overlapping transcript evidence to generate UTR. So look at it in a browser. See if the apparent overlap clusters are different in extent, also look for mRNA-seq evidence being merged. If the cluster is falsely merging between two loci because the mRNA-seq is merged, one of two things will happen you will get multiple models since the predictor can’t make a single model work within the cluster using the hints, or you will get a model with a really long UTR that is blocking other models from existing in the region. Also as depending on the mRNA-seq evidence coming in, you may be generating false models because of noise in the data. Essentially everything is transcribed at a basal level, so as you get more and more mRNA-seq, you generate more and more spurious alignments. So more evidence might gernate fewer long alignments for true loci or by falsely merging genes while simultaneously adding a number of very short spurious results.

—Carson



> On Oct 19, 2017, at 9:28 AM, Bob Zimmermann <[hidden email]> wrote:
>
> Correction to the above numbers, the median lengths are 1414 and 1256.
>
>> On 19 Oct 2017, at 17:25, Bob Zimmermann <[hidden email]> wrote:
>>
>> Hi Maker Developers,
>>
>> I have been playing around with several data sets as input to annotate our newly reassembled genome. We have 3 RNA seq datasets which have been assembled into de novo transcripts using Trinity. These are input into the maker pipeline along with protein evidence. What is strange is that when I run maker with the de novo transcripts from a single set, I optain more maker transcripts than when I run with a combined set (1619 vs 1450 on one chromosome) and they are longer (median transcript length 1619 vs 1450, IQR 872-2160 vs 667-2026). It might make sense if they were more and shorter if the additional evidence was joining transcripts, but this would indicate that it is not the case.
>>
>> Therefore I’m trying to understand the algorithm. From what I understand if it finds evidence for an ab initio prediction for which the internal splice junctions agree, then it is considered for improvement. Why, then, if my combined set is a strict superset of the single set, do i get more transcripts with the single set?
>>
>> Thanks for your help!
>>
>> Best,
>> Bob
>>
>> —
>>
>> Department of Molecular Evolution and Development
>> Universität Wien
>> Althanstraße 14 (UZA I), Zimmer 2.019
>> 1090 Vienna
>> Austria
>>
>> +43 1 427757002
>>
>
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org