scaffolds missing from master_datastore_index.log and all.gff files

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

scaffolds missing from master_datastore_index.log and all.gff files

Valerie Soza
Hi MAKER community

I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER.

In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED.

To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027.

$ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
12024   12024  313247

3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86.

$ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
12026   12026  313295

1 finished scaffold missing from this file is LG08_unordered_scaffold_90.

I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory.

After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file.

I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts.

Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished?

Thanks.

-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: scaffolds missing from master_datastore_index.log and all.gff files

Carson Holt-2
If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log.

You can delete the log and then run ‘maker -dsindex’ to rebuild it which a single maker process (takes less than 5 minutes ).

—Carson



> On Mar 14, 2018, at 6:21 PM, Valerie Soza <[hidden email]> wrote:
>
> Hi MAKER community
>
> I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER.
>
> In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED.
>
> To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027.
>
> $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
> 12024   12024  313247
>
> 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86.
>
> $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
> 12026   12026  313295
>
> 1 finished scaffold missing from this file is LG08_unordered_scaffold_90.
>
> I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory.
>
> After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file.
>
> I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts.
>
> Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished?
>
> Thanks.
>
> -Valerie
>
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
>
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: scaffolds missing from master_datastore_index.log and all.gff files

Valerie Soza
Thanks, Carson, that worked. I am now getting all scaffolds in the assembly indicated as finished in the master_datastore_index.log and when I redo gff3_merge with this updated log, my all.gff file is complete too. Glad there is a quick fix for this in MAKER. Thank you, MAKER developers.

-Valerie

> On Mar 15, 2018, at 8:26 AM, Carson Holt <[hidden email]> wrote:
>
> If running multiple jobs of MAEKR at the same time, you can hot a race condition where once MAKER run keeps another from making the correct entry into the datastore log.
>
> You can delete the log and then run ‘maker -dsindex’ to rebuild it which a single maker process (takes less than 5 minutes ).
>
> —Carson
>
>
>
>> On Mar 14, 2018, at 6:21 PM, Valerie Soza <[hidden email]> wrote:
>>
>> Hi MAKER community
>>
>> I have done several rounds of training and annotations on an assembly that consists of 12027 scaffolds using the MAKER2 pipeline. I am running multiple instances of MAKER to speed up the process. I have noticed that the number of contigs (aka scaffolds) differs among the different rounds of annotations I have done in MAKER, ranging from 12024 to 12027 scaffolds. These counts were obtained with the SOBAcl tool to count the number of contigs from each all.gff file generated by the gff3_merge script included in MAKER.
>>
>> In my latest round of annotations within MAKER, I have only obtained 12026 scaffolds using the SOBAcl tool on the all.gff file, indicating that I am missing 1 scaffold, even though there was no indication of any scaffolds as FAILED, RETRY, or SKIPPED.
>>
>> To figure out what might be going on, I searched for STARTED and FINISHED scaffolds in the master_datastore_index.log and found that I had a different number of started vs. finished scaffolds, and none of these were equal to the total in the assembly of 12027.
>>
>> $ grep STARTED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>> 12024   12024  313247
>>
>> 3 started scaffolds missing from this file are LG01_ordered_scaffold_2, LG01_ordered_scaffold_3, and LG07_unordered_scaffold_86.
>>
>> $ grep FINISHED Rwill7_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>> 12026   12026  313295
>>
>> 1 finished scaffold missing from this file is LG08_unordered_scaffold_90.
>>
>> I then searched for these scaffolds in the all.gff file and found that the 3 missing started scaffolds were present, but the one missing finished scaffold (LG08_unordered_scaffold_90) was not. This scaffold (LG08_unordered_scaffold_90) should be in the gff3 as it had some repeat masking done on it as indicated by the query.masked.fasta file for this scaffold in its theVoid directory.
>>
>> After looking at the gff3_merge and fasta_merge scripts, it seems that only finished scaffolds are used to generate gff3 and fasta files so this explains why I am missing one scaffold (LG08_unordered_scaffold_90) for a total of only 12026 scaffolds in the all.gff file.
>>
>> I am concerned that because the started and finished scaffolds are different in the master_datastore_index.log, that not all scaffolds are being output to the gff3 and fasta files generated by the MAKER scripts.
>>
>> Any insights as to why I am getting a different numbers of scaffolds indicated as started versus finished? and as to why all but 1 scaffold finished?
>>
>> Thanks.
>>
>> -Valerie
>>
>> Valerie Soza, Ph.D.
>> c/o Hall Lab
>> Department of Biology
>> University of Washington
>> Johnson Hall 202A
>> Box 351800
>> Seattle, WA 98195-1800
>> 206-543-6740
>> http://staff.washington.edu/vsoza/
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> [hidden email]
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org