Quantcast

advanced repeat masking library constructions & rna-seq assembly choices

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

advanced repeat masking library constructions & rna-seq assembly choices

Salim Bougouffa
Hi,

I am attempting to annotate a plant genome. I have a couple of questions:

1) RNA-seq assembly
a) I assembled my RNA-seq data using Trinity and StringTie. The two produce drastically different numbers. When I compare the two assemblies for each sample using TransRate, StringTie produces a higher score. for most of the assemblies. I see in all of the threads that you recommend Trinity but doesn't trinity produce way too many transcripts (even after chucking out the "bad" ones using transrate).
b) During hint creation in MAKER, does it take into account that different transcripts have different read coverage (expression levels). I guess my question is should I filter transcripts that have a small read coverage.

2) Repeat Masking 
I am following the advanced repeat library construction tutorial (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced). The initial steps find 15 sequences for the LTR and 159 for MITE. But, when I get to the perl DIR_CRL/CRL_Step4.pl step, both output files (Inner_Seq_For_BLAST.fasta, lLTRs_Seq_For_BLAST.fasta) are empty.

a) are these numbers normal because I was expecting a lot more than 16 for the LTR? 
b) I don't get any errors when I run CRL_Step4.pl yet no output. What's going on?!

Many thanks,
/SB
--

____________________________
Sent from Inbox Mobile


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: advanced repeat masking library constructions & rna-seq assembly choices

Carson Holt-2
Michael can you answer the second question (Michael wrote the protocol, so I CC’d him).

With respect to the first question. Expression level is not necessarily relevant to the annotation process (so no MAKER does not look at read coverage). Instead we use the transcript assemblies to identify introns via splice aware alignment (yes it is the introns and not the exons we care about). Trinity has a nice option called jaccard_clip which avoids false merging of neighboring transcripts (mostly occurs in fungi where UTR can overlap). Merging of transcripts will cause extra introns to be assigned as hints as well as potential overextension of UTR during final polishing steps. The jaccard_clip option is the main reason we recommend Trinity. If Stringtie has a similar option, then it can be used as well.

Thanks,
Carson



On May 4, 2017, at 12:37 AM, Salim Bougouffa <[hidden email]> wrote:

Hi,

I am attempting to annotate a plant genome. I have a couple of questions:

1) RNA-seq assembly
a) I assembled my RNA-seq data using Trinity and StringTie. The two produce drastically different numbers. When I compare the two assemblies for each sample using TransRate, StringTie produces a higher score. for most of the assemblies. I see in all of the threads that you recommend Trinity but doesn't trinity produce way too many transcripts (even after chucking out the "bad" ones using transrate).
b) During hint creation in MAKER, does it take into account that different transcripts have different read coverage (expression levels). I guess my question is should I filter transcripts that have a small read coverage.

2) Repeat Masking 
I am following the advanced repeat library construction tutorial (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced). The initial steps find 15 sequences for the LTR and 159 for MITE. But, when I get to the perl DIR_CRL/CRL_Step4.pl step, both output files (Inner_Seq_For_BLAST.fasta, lLTRs_Seq_For_BLAST.fasta) are empty.

a) are these numbers normal because I was expecting a lot more than 16 for the LTR? 
b) I don't get any errors when I run CRL_Step4.pl yet no output. What's going on?!

Many thanks,
/SB
--

____________________________
Sent from Inbox Mobile

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: advanced repeat masking library constructions & rna-seq assembly choices

Campbell, Michael
Hi SB,

I’ve added Ning Jaing to this email. She has put great effort into updating this protocol recently and will be able to address your questions better than I can.

Ning, would you mind helping out with this?

Thanks,
Mike

On May 7, 2017, at 9:17 PM, Carson Holt <[hidden email]<mailto:[hidden email]>> wrote:

Michael can you answer the second question (Michael wrote the protocol, so I CC’d him).

With respect to the first question. Expression level is not necessarily relevant to the annotation process (so no MAKER does not look at read coverage). Instead we use the transcript assemblies to identify introns via splice aware alignment (yes it is the introns and not the exons we care about). Trinity has a nice option called jaccard_clip which avoids false merging of neighboring transcripts (mostly occurs in fungi where UTR can overlap). Merging of transcripts will cause extra introns to be assigned as hints as well as potential overextension of UTR during final polishing steps. The jaccard_clip option is the main reason we recommend Trinity. If Stringtie has a similar option, then it can be used as well.

Thanks,
Carson



On May 4, 2017, at 12:37 AM, Salim Bougouffa <[hidden email]<mailto:[hidden email]>> wrote:

Hi,

I am attempting to annotate a plant genome. I have a couple of questions:

1) RNA-seq assembly
a) I assembled my RNA-seq data using Trinity and StringTie. The two produce drastically different numbers. When I compare the two assemblies for each sample using TransRate, StringTie produces a higher score. for most of the assemblies. I see in all of the threads that you recommend Trinity but doesn't trinity produce way too many transcripts (even after chucking out the "bad" ones using transrate).
b) During hint creation in MAKER, does it take into account that different transcripts have different read coverage (expression levels). I guess my question is should I filter transcripts that have a small read coverage.

2) Repeat Masking
I am following the advanced repeat library construction tutorial (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced). The initial steps find 15 sequences for the LTR and 159 for MITE. But, when I get to the perl DIR_CRL/CRL_Step4.pl step, both output files (Inner_Seq_For_BLAST.fasta, lLTRs_Seq_For_BLAST.fasta) are empty.

a) are these numbers normal because I was expecting a lot more than 16 for the LTR?
b) I don't get any errors when I run CRL_Step4.pl yet no output. What's going on?!

Many thanks,
/SB
--

____________________________
Sent from Inbox Mobile

_______________________________________________
maker-devel mailing list
[hidden email]<mailto:[hidden email]>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]<mailto:[hidden email]>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: advanced repeat masking library constructions & rna-seq assembly choices

jiangn

Hi Salim,


I am sorry to learn about the issues. it depends on the quality of your genome assembly for how many intact LTR elements you would get; however, 16 seems too low to me.


The inner and LTR sequence file should NOT be empty. Some times the issue could be due to that the initial sequence name is long and complicated. If that's the case for your sequences, you might want to simplify your sequence name (only including letters and numbers) and try again.


We are working on an automatic pipeline for LTR collection, if everything goes smoothly, it should be available in two to three months.


Best wishes,


Ning


From: Campbell, Michael <[hidden email]>
Sent: Sunday, May 7, 2017 9:24 PM
To: Carson Holt
Cc: Salim Bougouffa; [hidden email] List; Jiang, Ning
Subject: Re: [maker-devel] advanced repeat masking library constructions & rna-seq assembly choices
 
Hi SB,

I’ve added Ning Jaing to this email. She has put great effort into updating this protocol recently and will be able to address your questions better than I can.

Ning, would you mind helping out with this?

Thanks,
Mike

On May 7, 2017, at 9:17 PM, Carson Holt <[hidden email]<mailto:[hidden email]>> wrote:

Michael can you answer the second question (Michael wrote the protocol, so I CC’d him).

With respect to the first question. Expression level is not necessarily relevant to the annotation process (so no MAKER does not look at read coverage). Instead we use the transcript assemblies to identify introns via splice aware alignment (yes it is the introns and not the exons we care about). Trinity has a nice option called jaccard_clip which avoids false merging of neighboring transcripts (mostly occurs in fungi where UTR can overlap). Merging of transcripts will cause extra introns to be assigned as hints as well as potential overextension of UTR during final polishing steps. The jaccard_clip option is the main reason we recommend Trinity. If Stringtie has a similar option, then it can be used as well.

Thanks,
Carson



On May 4, 2017, at 12:37 AM, Salim Bougouffa <[hidden email]<mailto:[hidden email]>> wrote:

Hi,

I am attempting to annotate a plant genome. I have a couple of questions:

1) RNA-seq assembly
a) I assembled my RNA-seq data using Trinity and StringTie. The two produce drastically different numbers. When I compare the two assemblies for each sample using TransRate, StringTie produces a higher score. for most of the assemblies. I see in all of the threads that you recommend Trinity but doesn't trinity produce way too many transcripts (even after chucking out the "bad" ones using transrate).
b) During hint creation in MAKER, does it take into account that different transcripts have different read coverage (expression levels). I guess my question is should I filter transcripts that have a small read coverage.

2) Repeat Masking
I am following the advanced repeat library construction tutorial (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced). The initial steps find 15 sequences for the LTR and 159 for MITE. But, when I get to the perl DIR_CRL/CRL_Step4.pl step, both output files (Inner_Seq_For_BLAST.fasta, lLTRs_Seq_For_BLAST.fasta) are empty.

a) are these numbers normal because I was expecting a lot more than 16 for the LTR?
b) I don't get any errors when I run CRL_Step4.pl yet no output. What's going on?!

Many thanks,
/SB
--

____________________________
Sent from Inbox Mobile

_______________________________________________
maker-devel mailing list
[hidden email]<mailto:[hidden email]>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]<mailto:[hidden email]>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: advanced repeat masking library constructions & rna-seq assembly choices

Salim Bougouffa

Thank you all for your responses.

Regards,
/SB


On Mon, 8 May 2017, 18:50 Jiang, Ning, <[hidden email]> wrote:

Hi Salim,


I am sorry to learn about the issues. it depends on the quality of your genome assembly for how many intact LTR elements you would get; however, 16 seems too low to me.


The inner and LTR sequence file should NOT be empty. Some times the issue could be due to that the initial sequence name is long and complicated. If that's the case for your sequences, you might want to simplify your sequence name (only including letters and numbers) and try again.


We are working on an automatic pipeline for LTR collection, if everything goes smoothly, it should be available in two to three months.


Best wishes,


Ning


From: Campbell, Michael <[hidden email]>
Sent: Sunday, May 7, 2017 9:24 PM
To: Carson Holt
Cc: Salim Bougouffa; [hidden email] List; Jiang, Ning
Subject: Re: [maker-devel] advanced repeat masking library constructions & rna-seq assembly choices
 
Hi SB,

I’ve added Ning Jaing to this email. She has put great effort into updating this protocol recently and will be able to address your questions better than I can.

Ning, would you mind helping out with this?

Thanks,
Mike

On May 7, 2017, at 9:17 PM, Carson Holt <[hidden email]<mailto:[hidden email]>> wrote:

Michael can you answer the second question (Michael wrote the protocol, so I CC’d him).

With respect to the first question. Expression level is not necessarily relevant to the annotation process (so no MAKER does not look at read coverage). Instead we use the transcript assemblies to identify introns via splice aware alignment (yes it is the introns and not the exons we care about). Trinity has a nice option called jaccard_clip which avoids false merging of neighboring transcripts (mostly occurs in fungi where UTR can overlap). Merging of transcripts will cause extra introns to be assigned as hints as well as potential overextension of UTR during final polishing steps. The jaccard_clip option is the main reason we recommend Trinity. If Stringtie has a similar option, then it can be used as well.

Thanks,
Carson



On May 4, 2017, at 12:37 AM, Salim Bougouffa <[hidden email]<mailto:[hidden email]>> wrote:

Hi,

I am attempting to annotate a plant genome. I have a couple of questions:

1) RNA-seq assembly
a) I assembled my RNA-seq data using Trinity and StringTie. The two produce drastically different numbers. When I compare the two assemblies for each sample using TransRate, StringTie produces a higher score. for most of the assemblies. I see in all of the threads that you recommend Trinity but doesn't trinity produce way too many transcripts (even after chucking out the "bad" ones using transrate).
b) During hint creation in MAKER, does it take into account that different transcripts have different read coverage (expression levels). I guess my question is should I filter transcripts that have a small read coverage.

2) Repeat Masking
I am following the advanced repeat library construction tutorial (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced). The initial steps find 15 sequences for the LTR and 159 for MITE. But, when I get to the perl DIR_CRL/CRL_Step4.pl step, both output files (Inner_Seq_For_BLAST.fasta, lLTRs_Seq_For_BLAST.fasta) are empty.

a) are these numbers normal because I was expecting a lot more than 16 for the LTR?
b) I don't get any errors when I run CRL_Step4.pl yet no output. What's going on?!

Many thanks,
/SB
--

____________________________
Sent from Inbox Mobile

_______________________________________________
maker-devel mailing list
[hidden email]<mailto:[hidden email]>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]<mailto:[hidden email]>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Loading...