Advice on my pipeline

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Advice on my pipeline

Patrick Tran Van-2

Hello,

This is my first time running Maker for an insect genome annotation.

I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things:


What I have:
- RNA evidence: transcriptome
- Proteine evidence: swissprot/uniprot + busco protein set of insect
- Cegma and busco results of my genome


1) Train SNAP with CEGMA

2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco).

3) Create SNAP model from run A.

4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

5) Create SNAP model from run B.

6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

7)  Create SNAP model from run C AND Create Augustus gene model from run C

8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1



Does it seems coherent ?


Cheers,


Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Advice on my pipeline

Carson Holt-2
Your plan sounds good. A couple of related notes.

Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER.

Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can’t be run in the same directory).

—Carson


On Jun 2, 2017, at 3:56 AM, Patrick Tran Van <[hidden email]> wrote:

Hello,

This is my first time running Maker for an insect genome annotation. 

I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things:


What I have:
- RNA evidence: transcriptome
- Proteine evidence: swissprot/uniprot + busco protein set of insect
- Cegma and busco results of my genome


1) Train SNAP with CEGMA

2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco).

3) Create SNAP model from run A.

4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

5) Create SNAP model from run B.

6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

7)  Create SNAP model from run C AND Create Augustus gene model from run C

8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1



Does it seems coherent ?

Cheers,

Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Advice on my pipeline

Patrick Tran Van-2

Thanks for your answer.


1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ?

Because I am using autoAug for this and it tooks a while to compute ..


2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error:


WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl


(I am using v 2.31.8 )



Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206


From: Carson Holt <[hidden email]>
Sent: Monday, June 5, 2017 8:29 PM
To: Patrick Tran Van
Cc: [hidden email]
Subject: Re: [maker-devel] Advice on my pipeline
 
Your plan sounds good. A couple of related notes.

Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER.

Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can’t be run in the same directory).

—Carson


On Jun 2, 2017, at 3:56 AM, Patrick Tran Van <[hidden email]> wrote:

Hello,

This is my first time running Maker for an insect genome annotation. 

I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things:


What I have:
- RNA evidence: transcriptome
- Proteine evidence: swissprot/uniprot + busco protein set of insect
- Cegma and busco results of my genome


1) Train SNAP with CEGMA

2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco).

3) Create SNAP model from run A.

4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

5) Create SNAP model from run B.

6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

7)  Create SNAP model from run C AND Create Augustus gene model from run C

8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1



Does it seems coherent ?

Cheers,

Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Advice on my pipeline

Carson Holt-2
Sorry the option is —> correct_est_fusion

It is in the maker_opts.ctl file.

I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both.

—Carson



On Jun 26, 2017, at 3:48 AM, Patrick Tran Van <[hidden email]> wrote:

Thanks for your answer.

1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ?
Because I am using autoAug for this and it tooks a while to compute ..

2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error:

WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl

(I am using v 2.31.8 )


Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206


From: Carson Holt <[hidden email]>
Sent: Monday, June 5, 2017 8:29 PM
To: Patrick Tran Van
Cc: [hidden email]
Subject: Re: [maker-devel] Advice on my pipeline
 
Your plan sounds good. A couple of related notes.

Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER.

Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can’t be run in the same directory).

—Carson


On Jun 2, 2017, at 3:56 AM, Patrick Tran Van <[hidden email]> wrote:

Hello,

This is my first time running Maker for an insect genome annotation. 

I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things:


What I have:
- RNA evidence: transcriptome
- Proteine evidence: swissprot/uniprot + busco protein set of insect
- Cegma and busco results of my genome


1) Train SNAP with CEGMA

2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco).

3) Create SNAP model from run A.

4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

5) Create SNAP model from run B.

6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

7)  Create SNAP model from run C AND Create Augustus gene model from run C

8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1



Does it seems coherent ?

Cheers,

Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org




_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Advice on my pipeline

Patrick Tran Van-2
So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion.


I have then use SNAP to train/filter it with:


maker2zff  specie.all.gff


Here are my results:


Number of gene after maker -> Number of gene after maker2zff


- Without corrected_est_fusion: 21621 -> 13875

- With corrected_est_fusion: 16850 -> 9098


1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ?

Normally I should find more genes with corrected_est_fusion right ?


2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ?


 Thanks for your help




Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206


From: Carson Holt <[hidden email]>
Sent: Monday, June 26, 2017 11:38 PM
To: Patrick Tran Van
Cc: [hidden email]
Subject: Re: [maker-devel] Advice on my pipeline
 
Sorry the option is —> correct_est_fusion

It is in the maker_opts.ctl file.

I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both.

—Carson



On Jun 26, 2017, at 3:48 AM, Patrick Tran Van <[hidden email]> wrote:

Thanks for your answer.

1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ?
Because I am using autoAug for this and it tooks a while to compute ..

2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error:

WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl

(I am using v 2.31.8 )


Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206


From: Carson Holt <[hidden email]>
Sent: Monday, June 5, 2017 8:29 PM
To: Patrick Tran Van
Cc: [hidden email]
Subject: Re: [maker-devel] Advice on my pipeline
 
Your plan sounds good. A couple of related notes.

Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER.

Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can’t be run in the same directory).

—Carson


On Jun 2, 2017, at 3:56 AM, Patrick Tran Van <[hidden email]> wrote:

Hello,

This is my first time running Maker for an insect genome annotation. 

I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things:


What I have:
- RNA evidence: transcriptome
- Proteine evidence: swissprot/uniprot + busco protein set of insect
- Cegma and busco results of my genome


1) Train SNAP with CEGMA

2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco).

3) Create SNAP model from run A.

4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

5) Create SNAP model from run B.

6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

7)  Create SNAP model from run C AND Create Augustus gene model from run C

8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1



Does it seems coherent ?

Cheers,

Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org




_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Advice on my pipeline

Carson Holt-2
maker2zff is just for SNAP training and not for gene filtering (please do not use it for filtering, it does not do what you think).

So the final annotation set after maker with correct_est_fusion is 16,850. To decide which set is better, look at them in a browser (gene counts are not useful for guaging result). A well annotated genome will have evidence clusters that closely match the final models. A poorly annoted genome will have evidence clusters that are split or merged by the models.

The corrected_est_fusion does two things. It trims long overlapping UTR fragments, and it stops evidence clusters from being merged on BLASTP evidence alone (so gene predictors will get unmerged hint regions if clusters are split).

You may also find that using jaccard_clip with Trinity has reduced sensitivity for the transcript data (you may lose things that were there before, but now have better specificity, i.e. fewer false positives). Make sure you provided protein data from at least two related species to help maintain sensitivity lost form the transcript data. You can also add rejected genes models back in after the fact by using iprscan to identify unsupported models with identifiable protein domains —> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/

Thanks,
Carson




On Jul 1, 2017, at 5:21 AM, Patrick Tran Van <[hidden email]> wrote:

So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion.

I have then use SNAP to train/filter it with:

maker2zff  specie.all.gff

Here are my results:

Number of gene after maker -> Number of gene after maker2zff

- Without corrected_est_fusion: 21621 -> 13875
- With corrected_est_fusion: 16850 -> 9098

1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ?
Normally I should find more genes with corrected_est_fusion right ?

2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ?

 Thanks for your help 



Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206


From: Carson Holt <[hidden email]>
Sent: Monday, June 26, 2017 11:38 PM
To: Patrick Tran Van
Cc: [hidden email]
Subject: Re: [maker-devel] Advice on my pipeline
 
Sorry the option is —> correct_est_fusion

It is in the maker_opts.ctl file.

I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both.

—Carson



On Jun 26, 2017, at 3:48 AM, Patrick Tran Van <[hidden email]> wrote:

Thanks for your answer.

1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ?
Because I am using autoAug for this and it tooks a while to compute ..

2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error:

WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl

(I am using v 2.31.8 )


Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206


From: Carson Holt <[hidden email]>
Sent: Monday, June 5, 2017 8:29 PM
To: Patrick Tran Van
Cc: [hidden email]
Subject: Re: [maker-devel] Advice on my pipeline
 
Your plan sounds good. A couple of related notes.

Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER.

Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can’t be run in the same directory).

—Carson


On Jun 2, 2017, at 3:56 AM, Patrick Tran Van <[hidden email]> wrote:

Hello,

This is my first time running Maker for an insect genome annotation. 

I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things:


What I have:
- RNA evidence: transcriptome
- Proteine evidence: swissprot/uniprot + busco protein set of insect
- Cegma and busco results of my genome


1) Train SNAP with CEGMA

2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco).

3) Create SNAP model from run A.

4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

5) Create SNAP model from run B.

6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

7)  Create SNAP model from run C AND Create Augustus gene model from run C

8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1



Does it seems coherent ?

Cheers,

Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org






_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Advice on my pipeline

Patrick Tran Van-2

Hi Carson,


I have a doubt for the round 2, so in a previous reply you said:


" Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can’t be run in the same directory). "

 

Does it means that I don't need to modify the section :


#-----Re-annotation Using MAKER Derived GFF3


?


If I let everything by default such as :


altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no



It will not look again for repeat and protein + transcriptome alignment ?


Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206


From: Carson Holt <[hidden email]>
Sent: Monday, July 3, 2017 10:50 PM
To: Patrick Tran Van
Cc: [hidden email]
Subject: Re: [maker-devel] Advice on my pipeline
 
maker2zff is just for SNAP training and not for gene filtering (please do not use it for filtering, it does not do what you think).

So the final annotation set after maker with correct_est_fusion is 16,850. To decide which set is better, look at them in a browser (gene counts are not useful for guaging result). A well annotated genome will have evidence clusters that closely match the final models. A poorly annoted genome will have evidence clusters that are split or merged by the models.

The corrected_est_fusion does two things. It trims long overlapping UTR fragments, and it stops evidence clusters from being merged on BLASTP evidence alone (so gene predictors will get unmerged hint regions if clusters are split).

You may also find that using jaccard_clip with Trinity has reduced sensitivity for the transcript data (you may lose things that were there before, but now have better specificity, i.e. fewer false positives). Make sure you provided protein data from at least two related species to help maintain sensitivity lost form the transcript data. You can also add rejected genes models back in after the fact by using iprscan to identify unsupported models with identifiable protein domains —> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/

Thanks,
Carson




On Jul 1, 2017, at 5:21 AM, Patrick Tran Van <[hidden email]> wrote:

So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion.

I have then use SNAP to train/filter it with:

maker2zff  specie.all.gff

Here are my results:

Number of gene after maker -> Number of gene after maker2zff

- Without corrected_est_fusion: 21621 -> 13875
- With corrected_est_fusion: 16850 -> 9098

1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ?
Normally I should find more genes with corrected_est_fusion right ?

2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ?

 Thanks for your help 



Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206


From: Carson Holt <[hidden email]>
Sent: Monday, June 26, 2017 11:38 PM
To: Patrick Tran Van
Cc: [hidden email]
Subject: Re: [maker-devel] Advice on my pipeline
 
Sorry the option is —> correct_est_fusion

It is in the maker_opts.ctl file.

I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both.

—Carson



On Jun 26, 2017, at 3:48 AM, Patrick Tran Van <[hidden email]> wrote:

Thanks for your answer.

1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ?
Because I am using autoAug for this and it tooks a while to compute ..

2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error:

WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl

(I am using v 2.31.8 )


Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206


From: Carson Holt <[hidden email]>
Sent: Monday, June 5, 2017 8:29 PM
To: Patrick Tran Van
Cc: [hidden email]
Subject: Re: [maker-devel] Advice on my pipeline
 
Your plan sounds good. A couple of related notes.

Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER.

Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can’t be run in the same directory).

—Carson


On Jun 2, 2017, at 3:56 AM, Patrick Tran Van <[hidden email]> wrote:

Hello,

This is my first time running Maker for an insect genome annotation. 

I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things:


What I have:
- RNA evidence: transcriptome
- Proteine evidence: swissprot/uniprot + busco protein set of insect
- Cegma and busco results of my genome


1) Train SNAP with CEGMA

2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco).

3) Create SNAP model from run A.

4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

5) Create SNAP model from run B.

6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

7)  Create SNAP model from run C AND Create Augustus gene model from run C

8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1



Does it seems coherent ?

Cheers,

Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org






_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Advice on my pipeline

Carson Holt-2
The gff3 passthrough options are there to help users get old data into MAKER when they have lost access to the original files. But for iterative running of the pipeline, it is more effective just to rerun in place so MAKER can access the raw alignment reports. The raw reports from the alignments have more detail than what is stored in the GFF3. Details that are lost when trying to use the GFF3 as input.

—Carson


On Sep 21, 2017, at 3:26 AM, Patrick Tran Van <[hidden email]> wrote:

Hi Carson,

I have a doubt for the round 2, so in a previous reply you said:

" Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can’t be run in the same directory). "

 

Does it means that I don't need to modify the section :

#-----Re-annotation Using MAKER Derived GFF3

?

If I let everything by default such as :

altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no 


It will not look again for repeat and protein + transcriptome alignment ?

Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206


From: Carson Holt <[hidden email]>
Sent: Monday, July 3, 2017 10:50 PM
To: Patrick Tran Van
Cc: [hidden email]
Subject: Re: [maker-devel] Advice on my pipeline
 
maker2zff is just for SNAP training and not for gene filtering (please do not use it for filtering, it does not do what you think).

So the final annotation set after maker with correct_est_fusion is 16,850. To decide which set is better, look at them in a browser (gene counts are not useful for guaging result). A well annotated genome will have evidence clusters that closely match the final models. A poorly annoted genome will have evidence clusters that are split or merged by the models.

The corrected_est_fusion does two things. It trims long overlapping UTR fragments, and it stops evidence clusters from being merged on BLASTP evidence alone (so gene predictors will get unmerged hint regions if clusters are split).

You may also find that using jaccard_clip with Trinity has reduced sensitivity for the transcript data (you may lose things that were there before, but now have better specificity, i.e. fewer false positives). Make sure you provided protein data from at least two related species to help maintain sensitivity lost form the transcript data. You can also add rejected genes models back in after the fact by using iprscan to identify unsupported models with identifiable protein domains —> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/

Thanks,
Carson




On Jul 1, 2017, at 5:21 AM, Patrick Tran Van <[hidden email]> wrote:

So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion.

I have then use SNAP to train/filter it with:

maker2zff  specie.all.gff

Here are my results:

Number of gene after maker -> Number of gene after maker2zff

- Without corrected_est_fusion: 21621 -> 13875
- With corrected_est_fusion: 16850 -> 9098

1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ?
Normally I should find more genes with corrected_est_fusion right ?

2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ?

 Thanks for your help 



Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206


From: Carson Holt <[hidden email]>
Sent: Monday, June 26, 2017 11:38 PM
To: Patrick Tran Van
Cc: [hidden email]
Subject: Re: [maker-devel] Advice on my pipeline
 
Sorry the option is —> correct_est_fusion

It is in the maker_opts.ctl file.

I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both.

—Carson



On Jun 26, 2017, at 3:48 AM, Patrick Tran Van <[hidden email]> wrote:

Thanks for your answer.

1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ?
Because I am using autoAug for this and it tooks a while to compute ..

2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error:

WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl

(I am using v 2.31.8 )


Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206


From: Carson Holt <[hidden email]>
Sent: Monday, June 5, 2017 8:29 PM
To: Patrick Tran Van
Cc: [hidden email]
Subject: Re: [maker-devel] Advice on my pipeline
 
Your plan sounds good. A couple of related notes.

Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER.

Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can’t be run in the same directory).

—Carson


On Jun 2, 2017, at 3:56 AM, Patrick Tran Van <[hidden email]> wrote:

Hello,

This is my first time running Maker for an insect genome annotation. 

I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things:


What I have:
- RNA evidence: transcriptome
- Proteine evidence: swissprot/uniprot + busco protein set of insect
- Cegma and busco results of my genome


1) Train SNAP with CEGMA

2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco).

3) Create SNAP model from run A.

4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

5) Create SNAP model from run B.

6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

7)  Create SNAP model from run C AND Create Augustus gene model from run C

8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1



Does it seems coherent ?

Cheers,

Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org








_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org