short scaffolds finish, long scaffolds (almost always) fail

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

short scaffolds finish, long scaffolds (almost always) fail

Devon O'Rourke
Hello,

I apologize for not posting directly to the archived forum but it appears that the option to enter new posts is disabled. Perhaps this is by design so emails go directly to this address. I hope this is what you are looking for.

Thank you for your continued support of Maker and your responses to the forum posts. I have been running Maker (V3.01.02-beta) to annotate a mammalian genome that consists of 22 chromosome-length scaffolds (between ~200-20Mb) and about 10,000 smaller fragments from 1Mb to 10kb in length. In my various tests in running Maker, the vast majority of the smaller fragments are annotated successfully, but nearly all the large scaffolds fail with the same error code when I look at the 'run.log.child.0' file:
```
DIED RANK 0:6:0:0
DIED COUNT 2
```
(the master 'run.log' file just shows "DIED COUNT 2")

I struggled to find this exact error code anywhere on the forum and was hoping you might be able to help me determine where I should start troubleshooting. I thought perhaps it was an error concerning memory requirements, so I altered the chunk size from the default to a few larger sequence lengths (I've tried 1e6, 1e7, and 999,999,999 - all produce the same outcome). I've tried running the program with parallel support using either openMPI or mpich. I've tried running on a single node using 24 cpus and 120g of RAM. It always stalls at the same step.

Interestingly, one of the 22 large scaffolds always finishes and produces the .maker.proteins.fasta, .maker.transcripts.fasta, and .gff files, but the other 21 of 22 large scaffolds fail. This makes me think perhaps it's not a memory issue?

In the case of both the completed and failed scaffolds, the "theVoid.scaffoldX" subdirectory(ies) containing the .rb.cat.gz, .rb.out, .specific.ori.out, .specific.cat.gz, .specific.out, te_proteins*fasta.repeat runner, the est *fasta.blastn, the altest *fasta.tblastx, and protein *fasta.blastx files are all present (and appear finished from what I can tell).
However, the particular contents in the parent directory to the "theVoid.scaffold" folder differ. For the failed scaffolds, the contents generally always look something like this (that is, they stall with the same kind of files produced):
```
0
evidence_0.gff
query.fasta
query.masked.fasta
query.masked.fasta.index
query.masked.gff
run.log.child.0
scaffold22.0.final.section
scaffold22.0.pred.raw.section
scaffold22.0.raw.section
scaffold22.gff.ann
scaffold22.gff.def
scaffold22.gff.seq
```

For the completed scaffold, there are many more files created:
```
0
10
100
20
30
40
50
60
70
80
90
evidence_0.gff
evidence_10.gff
evidence_1.gff
evidence_2.gff
evidence_3.gff
evidence_4.gff
evidence_5.gff
evidence_6.gff
evidence_7.gff
evidence_8.gff
evidence_9.gff
query.fasta
query.masked.fasta
query.masked.fasta.index
query.masked.gff
run.log.child.0
run.log.child.1
run.log.child.10
run.log.child.2
run.log.child.3
run.log.child.4
run.log.child.5
run.log.child.6
run.log.child.7
run.log.child.8
run.log.child.9
scaffold4.0-1.raw.section
scaffold4.0.final.section
scaffold4.0.pred.raw.section
scaffold4.0.raw.section
scaffold4.10.final.section
scaffold4.10.pred.raw.section
scaffold4.10.raw.section
scaffold4.1-2.raw.section
scaffold4.1.final.section
scaffold4.1.pred.raw.section
scaffold4.1.raw.section
scaffold4.2-3.raw.section
scaffold4.2.final.section
scaffold4.2.pred.raw.section
scaffold4.2.raw.section
scaffold4.3-4.raw.section
scaffold4.3.final.section
scaffold4.3.pred.raw.section
scaffold4.3.raw.section
scaffold4.4-5.raw.section
scaffold4.4.final.section
scaffold4.4.pred.raw.section
scaffold4.4.raw.section
scaffold4.5-6.raw.section
scaffold4.5.final.section
scaffold4.5.pred.raw.section
scaffold4.5.raw.section
scaffold4.6-7.raw.section
scaffold4.6.final.section
scaffold4.6.pred.raw.section
scaffold4.6.raw.section
scaffold4.7-8.raw.section
scaffold4.7.final.section
scaffold4.7.pred.raw.section
scaffold4.7.raw.section
scaffold4.8-9.raw.section
scaffold4.8.final.section
scaffold4.8.pred.raw.section
scaffold4.8.raw.section
scaffold4.9-10.raw.section
scaffold4.9.final.section
scaffold4.9.pred.raw.section
scaffold4.9.raw.section
```

Thanks for any troubleshooting tips you can offer.

Cheers,
Devon

--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork

_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: short scaffolds finish, long scaffolds (almost always) fail

Carson Holt-2
If running under MPI, the reason for a failure may be further back in the STDERR (failures tend snowball other failures, so the initial cause is often way back). If you can capture the STDERR and send it, that would be the most informative. If its memory, you can also set all the blast_depth parameters in maker_botpts.ctl to a value like 20.

—Carson



On Feb 19, 2020, at 1:54 PM, Devon O'Rourke <[hidden email]> wrote:

Hello,

I apologize for not posting directly to the archived forum but it appears that the option to enter new posts is disabled. Perhaps this is by design so emails go directly to this address. I hope this is what you are looking for.

Thank you for your continued support of Maker and your responses to the forum posts. I have been running Maker (V3.01.02-beta) to annotate a mammalian genome that consists of 22 chromosome-length scaffolds (between ~200-20Mb) and about 10,000 smaller fragments from 1Mb to 10kb in length. In my various tests in running Maker, the vast majority of the smaller fragments are annotated successfully, but nearly all the large scaffolds fail with the same error code when I look at the 'run.log.child.0' file:
```
DIED RANK 0:6:0:0
DIED COUNT 2
```
(the master 'run.log' file just shows "DIED COUNT 2")

I struggled to find this exact error code anywhere on the forum and was hoping you might be able to help me determine where I should start troubleshooting. I thought perhaps it was an error concerning memory requirements, so I altered the chunk size from the default to a few larger sequence lengths (I've tried 1e6, 1e7, and 999,999,999 - all produce the same outcome). I've tried running the program with parallel support using either openMPI or mpich. I've tried running on a single node using 24 cpus and 120g of RAM. It always stalls at the same step.

Interestingly, one of the 22 large scaffolds always finishes and produces the .maker.proteins.fasta, .maker.transcripts.fasta, and .gff files, but the other 21 of 22 large scaffolds fail. This makes me think perhaps it's not a memory issue?

In the case of both the completed and failed scaffolds, the "theVoid.scaffoldX" subdirectory(ies) containing the .rb.cat.gz, .rb.out, .specific.ori.out, .specific.cat.gz, .specific.out, te_proteins*fasta.repeat runner, the est *fasta.blastn, the altest *fasta.tblastx, and protein *fasta.blastx files are all present (and appear finished from what I can tell).
However, the particular contents in the parent directory to the "theVoid.scaffold" folder differ. For the failed scaffolds, the contents generally always look something like this (that is, they stall with the same kind of files produced):
```
0
evidence_0.gff
query.fasta
query.masked.fasta
query.masked.fasta.index
query.masked.gff
run.log.child.0
scaffold22.0.final.section
scaffold22.0.pred.raw.section
scaffold22.0.raw.section
scaffold22.gff.ann
scaffold22.gff.def
scaffold22.gff.seq
```

For the completed scaffold, there are many more files created:
```
0
10
100
20
30
40
50
60
70
80
90
evidence_0.gff
evidence_10.gff
evidence_1.gff
evidence_2.gff
evidence_3.gff
evidence_4.gff
evidence_5.gff
evidence_6.gff
evidence_7.gff
evidence_8.gff
evidence_9.gff
query.fasta
query.masked.fasta
query.masked.fasta.index
query.masked.gff
run.log.child.0
run.log.child.1
run.log.child.10
run.log.child.2
run.log.child.3
run.log.child.4
run.log.child.5
run.log.child.6
run.log.child.7
run.log.child.8
run.log.child.9
scaffold4.0-1.raw.section
scaffold4.0.final.section
scaffold4.0.pred.raw.section
scaffold4.0.raw.section
scaffold4.10.final.section
scaffold4.10.pred.raw.section
scaffold4.10.raw.section
scaffold4.1-2.raw.section
scaffold4.1.final.section
scaffold4.1.pred.raw.section
scaffold4.1.raw.section
scaffold4.2-3.raw.section
scaffold4.2.final.section
scaffold4.2.pred.raw.section
scaffold4.2.raw.section
scaffold4.3-4.raw.section
scaffold4.3.final.section
scaffold4.3.pred.raw.section
scaffold4.3.raw.section
scaffold4.4-5.raw.section
scaffold4.4.final.section
scaffold4.4.pred.raw.section
scaffold4.4.raw.section
scaffold4.5-6.raw.section
scaffold4.5.final.section
scaffold4.5.pred.raw.section
scaffold4.5.raw.section
scaffold4.6-7.raw.section
scaffold4.6.final.section
scaffold4.6.pred.raw.section
scaffold4.6.raw.section
scaffold4.7-8.raw.section
scaffold4.7.final.section
scaffold4.7.pred.raw.section
scaffold4.7.raw.section
scaffold4.8-9.raw.section
scaffold4.8.final.section
scaffold4.8.pred.raw.section
scaffold4.8.raw.section
scaffold4.9-10.raw.section
scaffold4.9.final.section
scaffold4.9.pred.raw.section
scaffold4.9.raw.section
```

Thanks for any troubleshooting tips you can offer.

Cheers,
Devon

--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: short scaffolds finish, long scaffolds (almost always) fail

Devon O'Rourke
Much appreciated Carson,
I've submitted a job using the parameters you've suggested and will post the outcome. We definitely have two of three MPI options you've described on our cluster (OpenMPI and MPICH2); I'll check on Intel MPI. Happy to advise my cluster admins to use whichever software you prefer (should there be one).
Thanks,
Devon

On Wed, Feb 26, 2020 at 2:54 PM Carson Holt <[hidden email]> wrote:
Try adding these a few options right after ‘mpiexec’ in your batch script (this will fix infiniband related segfaults as well as some fork related segfaults) —> --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0

Also remove the -q in the maker command to get full command lines for subprocesses in the STDERR (allows you to run some commands outside of MAKER to test the source of failures if for example BLASt or Exonerate is causing the segfault).

Example —>
mpiexec --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0 -n 28 /packages/maker/3.01.02-beta/bin/maker -base lu -fix_nucleotides 


One alternate possibility is that OpenMPI is the problem, I’ve seen a few systems where it has an issue with perl itself, and the only way to get around it is to install your own version of perl without perl threads enabled and install MAKER with that version of Perl (then OpenMPI seems to be ok again). If that’s the case it is often easier to switch to MPICH2 or Intel MPI as the MPI launcher if they are available and then reinstall MAKER with that MPI flavor.

—Carson 



On Feb 26, 2020, at 12:36 PM, Devon O'Rourke <[hidden email]> wrote:

Thanks very much for the reply Carson,
I've attached few files file of the most recently failed run: the shell script submitted to Slurm, the _opts.ctl file, and the pair of log files generated from the job. The reason there are a 1a and 1b pair of files is that I had initially set the number of cpus in the _opts.ctl file to "60", but then tried re-running it after setting it to "28". Both seem to have the same result.
I certainly have access to more memory if needed. I'm using a pretty typical (I think?) cluster that controls jobs with Slurm using a Lustre file system - it's the main high performance computing center at our university. I have access to plenty of nodes that contain about 120-150g of RAM each with between 24-28 cpus each, as well a handful of higher memory nodes with about 1.5tb of RAM. As I'm writing this email, I've submitted a similar Maker job (i.e. same fasta/gff inputs) requesting 200g of RAM over 32 cpus; if that fails, I could certainly run again with even more memory.
Appreciate your insights; hope the weather in UT is filled with sun or snow or both.
Devon

On Wed, Feb 26, 2020 at 2:10 PM Carson Holt <[hidden email]> wrote:
If running under MPI, the reason for a failure may be further back in the STDERR (failures tend snowball other failures, so the initial cause is often way back). If you can capture the STDERR and send it, that would be the most informative. If its memory, you can also set all the blast_depth parameters in maker_botpts.ctl to a value like 20.

—Carson



On Feb 19, 2020, at 1:54 PM, Devon O'Rourke <[hidden email]> wrote:

Hello,

I apologize for not posting directly to the archived forum but it appears that the option to enter new posts is disabled. Perhaps this is by design so emails go directly to this address. I hope this is what you are looking for.

Thank you for your continued support of Maker and your responses to the forum posts. I have been running Maker (V3.01.02-beta) to annotate a mammalian genome that consists of 22 chromosome-length scaffolds (between ~200-20Mb) and about 10,000 smaller fragments from 1Mb to 10kb in length. In my various tests in running Maker, the vast majority of the smaller fragments are annotated successfully, but nearly all the large scaffolds fail with the same error code when I look at the 'run.log.child.0' file:
```
DIED RANK 0:6:0:0
DIED COUNT 2
```
(the master 'run.log' file just shows "DIED COUNT 2")

I struggled to find this exact error code anywhere on the forum and was hoping you might be able to help me determine where I should start troubleshooting. I thought perhaps it was an error concerning memory requirements, so I altered the chunk size from the default to a few larger sequence lengths (I've tried 1e6, 1e7, and 999,999,999 - all produce the same outcome). I've tried running the program with parallel support using either openMPI or mpich. I've tried running on a single node using 24 cpus and 120g of RAM. It always stalls at the same step.

Interestingly, one of the 22 large scaffolds always finishes and produces the .maker.proteins.fasta, .maker.transcripts.fasta, and .gff files, but the other 21 of 22 large scaffolds fail. This makes me think perhaps it's not a memory issue?

In the case of both the completed and failed scaffolds, the "theVoid.scaffoldX" subdirectory(ies) containing the .rb.cat.gz, .rb.out, .specific.ori.out, .specific.cat.gz, .specific.out, te_proteins*fasta.repeat runner, the est *fasta.blastn, the altest *fasta.tblastx, and protein *fasta.blastx files are all present (and appear finished from what I can tell).
However, the particular contents in the parent directory to the "theVoid.scaffold" folder differ. For the failed scaffolds, the contents generally always look something like this (that is, they stall with the same kind of files produced):
```
0
evidence_0.gff
query.fasta
query.masked.fasta
query.masked.fasta.index
query.masked.gff
run.log.child.0
scaffold22.0.final.section
scaffold22.0.pred.raw.section
scaffold22.0.raw.section
scaffold22.gff.ann
scaffold22.gff.def
scaffold22.gff.seq
```

For the completed scaffold, there are many more files created:
```
0
10
100
20
30
40
50
60
70
80
90
evidence_0.gff
evidence_10.gff
evidence_1.gff
evidence_2.gff
evidence_3.gff
evidence_4.gff
evidence_5.gff
evidence_6.gff
evidence_7.gff
evidence_8.gff
evidence_9.gff
query.fasta
query.masked.fasta
query.masked.fasta.index
query.masked.gff
run.log.child.0
run.log.child.1
run.log.child.10
run.log.child.2
run.log.child.3
run.log.child.4
run.log.child.5
run.log.child.6
run.log.child.7
run.log.child.8
run.log.child.9
scaffold4.0-1.raw.section
scaffold4.0.final.section
scaffold4.0.pred.raw.section
scaffold4.0.raw.section
scaffold4.10.final.section
scaffold4.10.pred.raw.section
scaffold4.10.raw.section
scaffold4.1-2.raw.section
scaffold4.1.final.section
scaffold4.1.pred.raw.section
scaffold4.1.raw.section
scaffold4.2-3.raw.section
scaffold4.2.final.section
scaffold4.2.pred.raw.section
scaffold4.2.raw.section
scaffold4.3-4.raw.section
scaffold4.3.final.section
scaffold4.3.pred.raw.section
scaffold4.3.raw.section
scaffold4.4-5.raw.section
scaffold4.4.final.section
scaffold4.4.pred.raw.section
scaffold4.4.raw.section
scaffold4.5-6.raw.section
scaffold4.5.final.section
scaffold4.5.pred.raw.section
scaffold4.5.raw.section
scaffold4.6-7.raw.section
scaffold4.6.final.section
scaffold4.6.pred.raw.section
scaffold4.6.raw.section
scaffold4.7-8.raw.section
scaffold4.7.final.section
scaffold4.7.pred.raw.section
scaffold4.7.raw.section
scaffold4.8-9.raw.section
scaffold4.8.final.section
scaffold4.8.pred.raw.section
scaffold4.8.raw.section
scaffold4.9-10.raw.section
scaffold4.9.final.section
scaffold4.9.pred.raw.section
scaffold4.9.raw.section
```

Thanks for any troubleshooting tips you can offer.

Cheers,
Devon

--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org



--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
<fail-1a.log.gz><fail-1b.log.gz><run1_maker_opts.ctl><run1_slurm.sh>



--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork

_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: short scaffolds finish, long scaffolds (almost always) fail

Carson Holt-2
For Intel MPI, export an environmental variable right before running MAKER —> "export I_MPI_FABRICS=shm:tcp"

Intel MPI has a similar infiniband segfault issue as OpenMPI when running Perl scripts, but a different workaround.

—Carson


On Feb 26, 2020, at 1:15 PM, Devon O'Rourke <[hidden email]> wrote:

Much appreciated Carson,
I've submitted a job using the parameters you've suggested and will post the outcome. We definitely have two of three MPI options you've described on our cluster (OpenMPI and MPICH2); I'll check on Intel MPI. Happy to advise my cluster admins to use whichever software you prefer (should there be one).
Thanks,
Devon

On Wed, Feb 26, 2020 at 2:54 PM Carson Holt <[hidden email]> wrote:
Try adding these a few options right after ‘mpiexec’ in your batch script (this will fix infiniband related segfaults as well as some fork related segfaults) —> --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0

Also remove the -q in the maker command to get full command lines for subprocesses in the STDERR (allows you to run some commands outside of MAKER to test the source of failures if for example BLASt or Exonerate is causing the segfault).

Example —>
mpiexec --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0 -n 28 /packages/maker/3.01.02-beta/bin/maker -base lu -fix_nucleotides 


One alternate possibility is that OpenMPI is the problem, I’ve seen a few systems where it has an issue with perl itself, and the only way to get around it is to install your own version of perl without perl threads enabled and install MAKER with that version of Perl (then OpenMPI seems to be ok again). If that’s the case it is often easier to switch to MPICH2 or Intel MPI as the MPI launcher if they are available and then reinstall MAKER with that MPI flavor.

—Carson 



On Feb 26, 2020, at 12:36 PM, Devon O'Rourke <[hidden email]> wrote:

Thanks very much for the reply Carson,
I've attached few files file of the most recently failed run: the shell script submitted to Slurm, the _opts.ctl file, and the pair of log files generated from the job. The reason there are a 1a and 1b pair of files is that I had initially set the number of cpus in the _opts.ctl file to "60", but then tried re-running it after setting it to "28". Both seem to have the same result.
I certainly have access to more memory if needed. I'm using a pretty typical (I think?) cluster that controls jobs with Slurm using a Lustre file system - it's the main high performance computing center at our university. I have access to plenty of nodes that contain about 120-150g of RAM each with between 24-28 cpus each, as well a handful of higher memory nodes with about 1.5tb of RAM. As I'm writing this email, I've submitted a similar Maker job (i.e. same fasta/gff inputs) requesting 200g of RAM over 32 cpus; if that fails, I could certainly run again with even more memory.
Appreciate your insights; hope the weather in UT is filled with sun or snow or both.
Devon

On Wed, Feb 26, 2020 at 2:10 PM Carson Holt <[hidden email]> wrote:
If running under MPI, the reason for a failure may be further back in the STDERR (failures tend snowball other failures, so the initial cause is often way back). If you can capture the STDERR and send it, that would be the most informative. If its memory, you can also set all the blast_depth parameters in maker_botpts.ctl to a value like 20.

—Carson



On Feb 19, 2020, at 1:54 PM, Devon O'Rourke <[hidden email]> wrote:

Hello,

I apologize for not posting directly to the archived forum but it appears that the option to enter new posts is disabled. Perhaps this is by design so emails go directly to this address. I hope this is what you are looking for.

Thank you for your continued support of Maker and your responses to the forum posts. I have been running Maker (V3.01.02-beta) to annotate a mammalian genome that consists of 22 chromosome-length scaffolds (between ~200-20Mb) and about 10,000 smaller fragments from 1Mb to 10kb in length. In my various tests in running Maker, the vast majority of the smaller fragments are annotated successfully, but nearly all the large scaffolds fail with the same error code when I look at the 'run.log.child.0' file:
```
DIED RANK 0:6:0:0
DIED COUNT 2
```
(the master 'run.log' file just shows "DIED COUNT 2")

I struggled to find this exact error code anywhere on the forum and was hoping you might be able to help me determine where I should start troubleshooting. I thought perhaps it was an error concerning memory requirements, so I altered the chunk size from the default to a few larger sequence lengths (I've tried 1e6, 1e7, and 999,999,999 - all produce the same outcome). I've tried running the program with parallel support using either openMPI or mpich. I've tried running on a single node using 24 cpus and 120g of RAM. It always stalls at the same step.

Interestingly, one of the 22 large scaffolds always finishes and produces the .maker.proteins.fasta, .maker.transcripts.fasta, and .gff files, but the other 21 of 22 large scaffolds fail. This makes me think perhaps it's not a memory issue?

In the case of both the completed and failed scaffolds, the "theVoid.scaffoldX" subdirectory(ies) containing the .rb.cat.gz, .rb.out, .specific.ori.out, .specific.cat.gz, .specific.out, te_proteins*fasta.repeat runner, the est *fasta.blastn, the altest *fasta.tblastx, and protein *fasta.blastx files are all present (and appear finished from what I can tell).
However, the particular contents in the parent directory to the "theVoid.scaffold" folder differ. For the failed scaffolds, the contents generally always look something like this (that is, they stall with the same kind of files produced):
```
0
evidence_0.gff
query.fasta
query.masked.fasta
query.masked.fasta.index
query.masked.gff
run.log.child.0
scaffold22.0.final.section
scaffold22.0.pred.raw.section
scaffold22.0.raw.section
scaffold22.gff.ann
scaffold22.gff.def
scaffold22.gff.seq
```

For the completed scaffold, there are many more files created:
```
0
10
100
20
30
40
50
60
70
80
90
evidence_0.gff
evidence_10.gff
evidence_1.gff
evidence_2.gff
evidence_3.gff
evidence_4.gff
evidence_5.gff
evidence_6.gff
evidence_7.gff
evidence_8.gff
evidence_9.gff
query.fasta
query.masked.fasta
query.masked.fasta.index
query.masked.gff
run.log.child.0
run.log.child.1
run.log.child.10
run.log.child.2
run.log.child.3
run.log.child.4
run.log.child.5
run.log.child.6
run.log.child.7
run.log.child.8
run.log.child.9
scaffold4.0-1.raw.section
scaffold4.0.final.section
scaffold4.0.pred.raw.section
scaffold4.0.raw.section
scaffold4.10.final.section
scaffold4.10.pred.raw.section
scaffold4.10.raw.section
scaffold4.1-2.raw.section
scaffold4.1.final.section
scaffold4.1.pred.raw.section
scaffold4.1.raw.section
scaffold4.2-3.raw.section
scaffold4.2.final.section
scaffold4.2.pred.raw.section
scaffold4.2.raw.section
scaffold4.3-4.raw.section
scaffold4.3.final.section
scaffold4.3.pred.raw.section
scaffold4.3.raw.section
scaffold4.4-5.raw.section
scaffold4.4.final.section
scaffold4.4.pred.raw.section
scaffold4.4.raw.section
scaffold4.5-6.raw.section
scaffold4.5.final.section
scaffold4.5.pred.raw.section
scaffold4.5.raw.section
scaffold4.6-7.raw.section
scaffold4.6.final.section
scaffold4.6.pred.raw.section
scaffold4.6.raw.section
scaffold4.7-8.raw.section
scaffold4.7.final.section
scaffold4.7.pred.raw.section
scaffold4.7.raw.section
scaffold4.8-9.raw.section
scaffold4.8.final.section
scaffold4.8.pred.raw.section
scaffold4.8.raw.section
scaffold4.9-10.raw.section
scaffold4.9.final.section
scaffold4.9.pred.raw.section
scaffold4.9.raw.section
```

Thanks for any troubleshooting tips you can offer.

Cheers,
Devon

--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org



--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
<fail-1a.log.gz><fail-1b.log.gz><run1_maker_opts.ctl><run1_slurm.sh>



--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork


_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: short scaffolds finish, long scaffolds (almost always) fail

Devon O'Rourke
Hi Carson,
Two steps forward, one step back, I suppose?
After incorporating the additional MPI-related parameters the job moved further ahead than previous iterations, however it still failed prior to completing the job. It appears that all but the six longest scaffolds were annotated (except for a small few short scaffolds which simply weren't finished by the time the error triggered the entire run to stop).
I've attached the .log file in hopes that you might find any additional nuggets to help diagnose the problem. Very much appreciate your help.
Devon


On Wed, Feb 26, 2020 at 3:18 PM Carson Holt <[hidden email]> wrote:
For Intel MPI, export an environmental variable right before running MAKER —> "export I_MPI_FABRICS=shm:tcp"

Intel MPI has a similar infiniband segfault issue as OpenMPI when running Perl scripts, but a different workaround.

—Carson


On Feb 26, 2020, at 1:15 PM, Devon O'Rourke <[hidden email]> wrote:

Much appreciated Carson,
I've submitted a job using the parameters you've suggested and will post the outcome. We definitely have two of three MPI options you've described on our cluster (OpenMPI and MPICH2); I'll check on Intel MPI. Happy to advise my cluster admins to use whichever software you prefer (should there be one).
Thanks,
Devon

On Wed, Feb 26, 2020 at 2:54 PM Carson Holt <[hidden email]> wrote:
Try adding these a few options right after ‘mpiexec’ in your batch script (this will fix infiniband related segfaults as well as some fork related segfaults) —> --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0

Also remove the -q in the maker command to get full command lines for subprocesses in the STDERR (allows you to run some commands outside of MAKER to test the source of failures if for example BLASt or Exonerate is causing the segfault).

Example —>
mpiexec --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0 -n 28 /packages/maker/3.01.02-beta/bin/maker -base lu -fix_nucleotides 


One alternate possibility is that OpenMPI is the problem, I’ve seen a few systems where it has an issue with perl itself, and the only way to get around it is to install your own version of perl without perl threads enabled and install MAKER with that version of Perl (then OpenMPI seems to be ok again). If that’s the case it is often easier to switch to MPICH2 or Intel MPI as the MPI launcher if they are available and then reinstall MAKER with that MPI flavor.

—Carson 



On Feb 26, 2020, at 12:36 PM, Devon O'Rourke <[hidden email]> wrote:

Thanks very much for the reply Carson,
I've attached few files file of the most recently failed run: the shell script submitted to Slurm, the _opts.ctl file, and the pair of log files generated from the job. The reason there are a 1a and 1b pair of files is that I had initially set the number of cpus in the _opts.ctl file to "60", but then tried re-running it after setting it to "28". Both seem to have the same result.
I certainly have access to more memory if needed. I'm using a pretty typical (I think?) cluster that controls jobs with Slurm using a Lustre file system - it's the main high performance computing center at our university. I have access to plenty of nodes that contain about 120-150g of RAM each with between 24-28 cpus each, as well a handful of higher memory nodes with about 1.5tb of RAM. As I'm writing this email, I've submitted a similar Maker job (i.e. same fasta/gff inputs) requesting 200g of RAM over 32 cpus; if that fails, I could certainly run again with even more memory.
Appreciate your insights; hope the weather in UT is filled with sun or snow or both.
Devon

On Wed, Feb 26, 2020 at 2:10 PM Carson Holt <[hidden email]> wrote:
If running under MPI, the reason for a failure may be further back in the STDERR (failures tend snowball other failures, so the initial cause is often way back). If you can capture the STDERR and send it, that would be the most informative. If its memory, you can also set all the blast_depth parameters in maker_botpts.ctl to a value like 20.

—Carson



On Feb 19, 2020, at 1:54 PM, Devon O'Rourke <[hidden email]> wrote:

Hello,

I apologize for not posting directly to the archived forum but it appears that the option to enter new posts is disabled. Perhaps this is by design so emails go directly to this address. I hope this is what you are looking for.

Thank you for your continued support of Maker and your responses to the forum posts. I have been running Maker (V3.01.02-beta) to annotate a mammalian genome that consists of 22 chromosome-length scaffolds (between ~200-20Mb) and about 10,000 smaller fragments from 1Mb to 10kb in length. In my various tests in running Maker, the vast majority of the smaller fragments are annotated successfully, but nearly all the large scaffolds fail with the same error code when I look at the 'run.log.child.0' file:
```
DIED RANK 0:6:0:0
DIED COUNT 2
```
(the master 'run.log' file just shows "DIED COUNT 2")

I struggled to find this exact error code anywhere on the forum and was hoping you might be able to help me determine where I should start troubleshooting. I thought perhaps it was an error concerning memory requirements, so I altered the chunk size from the default to a few larger sequence lengths (I've tried 1e6, 1e7, and 999,999,999 - all produce the same outcome). I've tried running the program with parallel support using either openMPI or mpich. I've tried running on a single node using 24 cpus and 120g of RAM. It always stalls at the same step.

Interestingly, one of the 22 large scaffolds always finishes and produces the .maker.proteins.fasta, .maker.transcripts.fasta, and .gff files, but the other 21 of 22 large scaffolds fail. This makes me think perhaps it's not a memory issue?

In the case of both the completed and failed scaffolds, the "theVoid.scaffoldX" subdirectory(ies) containing the .rb.cat.gz, .rb.out, .specific.ori.out, .specific.cat.gz, .specific.out, te_proteins*fasta.repeat runner, the est *fasta.blastn, the altest *fasta.tblastx, and protein *fasta.blastx files are all present (and appear finished from what I can tell).
However, the particular contents in the parent directory to the "theVoid.scaffold" folder differ. For the failed scaffolds, the contents generally always look something like this (that is, they stall with the same kind of files produced):
```
0
evidence_0.gff
query.fasta
query.masked.fasta
query.masked.fasta.index
query.masked.gff
run.log.child.0
scaffold22.0.final.section
scaffold22.0.pred.raw.section
scaffold22.0.raw.section
scaffold22.gff.ann
scaffold22.gff.def
scaffold22.gff.seq
```

For the completed scaffold, there are many more files created:
```
0
10
100
20
30
40
50
60
70
80
90
evidence_0.gff
evidence_10.gff
evidence_1.gff
evidence_2.gff
evidence_3.gff
evidence_4.gff
evidence_5.gff
evidence_6.gff
evidence_7.gff
evidence_8.gff
evidence_9.gff
query.fasta
query.masked.fasta
query.masked.fasta.index
query.masked.gff
run.log.child.0
run.log.child.1
run.log.child.10
run.log.child.2
run.log.child.3
run.log.child.4
run.log.child.5
run.log.child.6
run.log.child.7
run.log.child.8
run.log.child.9
scaffold4.0-1.raw.section
scaffold4.0.final.section
scaffold4.0.pred.raw.section
scaffold4.0.raw.section
scaffold4.10.final.section
scaffold4.10.pred.raw.section
scaffold4.10.raw.section
scaffold4.1-2.raw.section
scaffold4.1.final.section
scaffold4.1.pred.raw.section
scaffold4.1.raw.section
scaffold4.2-3.raw.section
scaffold4.2.final.section
scaffold4.2.pred.raw.section
scaffold4.2.raw.section
scaffold4.3-4.raw.section
scaffold4.3.final.section
scaffold4.3.pred.raw.section
scaffold4.3.raw.section
scaffold4.4-5.raw.section
scaffold4.4.final.section
scaffold4.4.pred.raw.section
scaffold4.4.raw.section
scaffold4.5-6.raw.section
scaffold4.5.final.section
scaffold4.5.pred.raw.section
scaffold4.5.raw.section
scaffold4.6-7.raw.section
scaffold4.6.final.section
scaffold4.6.pred.raw.section
scaffold4.6.raw.section
scaffold4.7-8.raw.section
scaffold4.7.final.section
scaffold4.7.pred.raw.section
scaffold4.7.raw.section
scaffold4.8-9.raw.section
scaffold4.8.final.section
scaffold4.8.pred.raw.section
scaffold4.8.raw.section
scaffold4.9-10.raw.section
scaffold4.9.final.section
scaffold4.9.pred.raw.section
scaffold4.9.raw.section
```

Thanks for any troubleshooting tips you can offer.

Cheers,
Devon

--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org



--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
<fail-1a.log.gz><fail-1b.log.gz><run1_maker_opts.ctl><run1_slurm.sh>



--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork



--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork

_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

LUmaker.log.gz (6M) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: short scaffolds finish, long scaffolds (almost always) fail

Devon O'Rourke
In reply to this post by Carson Holt-2
Hi Carson,
I had previously tried sending this email yesterday but received a notification about the text body size being too large. I thought perhaps it was related to the attached log file I sent in the earlier message. You can see the same file here: https://osf.io/cuxg8/download.
Thanks!

(previous message below)

....

Two steps forward, one step back, I suppose?
After incorporating the additional MPI-related parameters the job moved further ahead than previous iterations, however it still failed prior to completing the job. It appears that all but the six longest scaffolds were annotated (except for a small few short scaffolds which simply weren't finished by the time the error triggered the entire run to stop).
I've attached the .log file in hopes that you might find any additional nuggets to help diagnose the problem. Very much appreciate your help.
Devon

On Wed, Feb 26, 2020 at 3:18 PM Carson Holt <[hidden email]> wrote:
For Intel MPI, export an environmental variable right before running MAKER —> "export I_MPI_FABRICS=shm:tcp"

Intel MPI has a similar infiniband segfault issue as OpenMPI when running Perl scripts, but a different workaround.

—Carson


On Feb 26, 2020, at 1:15 PM, Devon O'Rourke <[hidden email]> wrote:

Much appreciated Carson,
I've submitted a job using the parameters you've suggested and will post the outcome. We definitely have two of three MPI options you've described on our cluster (OpenMPI and MPICH2); I'll check on Intel MPI. Happy to advise my cluster admins to use whichever software you prefer (should there be one).
Thanks,
Devon

On Wed, Feb 26, 2020 at 2:54 PM Carson Holt <[hidden email]> wrote:
Try adding these a few options right after ‘mpiexec’ in your batch script (this will fix infiniband related segfaults as well as some fork related segfaults) —> --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0

Also remove the -q in the maker command to get full command lines for subprocesses in the STDERR (allows you to run some commands outside of MAKER to test the source of failures if for example BLASt or Exonerate is causing the segfault).

Example —>
mpiexec --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0 -n 28 /packages/maker/3.01.02-beta/bin/maker -base lu -fix_nucleotides 


One alternate possibility is that OpenMPI is the problem, I’ve seen a few systems where it has an issue with perl itself, and the only way to get around it is to install your own version of perl without perl threads enabled and install MAKER with that version of Perl (then OpenMPI seems to be ok again). If that’s the case it is often easier to switch to MPICH2 or Intel MPI as the MPI launcher if they are available and then reinstall MAKER with that MPI flavor.

—Carson 



On Feb 26, 2020, at 12:36 PM, Devon O'Rourke <[hidden email]> wrote:

Thanks very much for the reply Carson,
I've attached few files file of the most recently failed run: the shell script submitted to Slurm, the _opts.ctl file, and the pair of log files generated from the job. The reason there are a 1a and 1b pair of files is that I had initially set the number of cpus in the _opts.ctl file to "60", but then tried re-running it after setting it to "28". Both seem to have the same result.
I certainly have access to more memory if needed. I'm using a pretty typical (I think?) cluster that controls jobs with Slurm using a Lustre file system - it's the main high performance computing center at our university. I have access to plenty of nodes that contain about 120-150g of RAM each with between 24-28 cpus each, as well a handful of higher memory nodes with about 1.5tb of RAM. As I'm writing this email, I've submitted a similar Maker job (i.e. same fasta/gff inputs) requesting 200g of RAM over 32 cpus; if that fails, I could certainly run again with even more memory.
Appreciate your insights; hope the weather in UT is filled with sun or snow or both.
Devon

On Wed, Feb 26, 2020 at 2:10 PM Carson Holt <[hidden email]> wrote:
If running under MPI, the reason for a failure may be further back in the STDERR (failures tend snowball other failures, so the initial cause is often way back). If you can capture the STDERR and send it, that would be the most informative. If its memory, you can also set all the blast_depth parameters in maker_botpts.ctl to a value like 20.

—Carson



On Feb 19, 2020, at 1:54 PM, Devon O'Rourke <[hidden email]> wrote:

Hello,

I apologize for not posting directly to the archived forum but it appears that the option to enter new posts is disabled. Perhaps this is by design so emails go directly to this address. I hope this is what you are looking for.

Thank you for your continued support of Maker and your responses to the forum posts. I have been running Maker (V3.01.02-beta) to annotate a mammalian genome that consists of 22 chromosome-length scaffolds (between ~200-20Mb) and about 10,000 smaller fragments from 1Mb to 10kb in length. In my various tests in running Maker, the vast majority of the smaller fragments are annotated successfully, but nearly all the large scaffolds fail with the same error code when I look at the 'run.log.child.0' file:
```
DIED RANK 0:6:0:0
DIED COUNT 2
```
(the master 'run.log' file just shows "DIED COUNT 2")

I struggled to find this exact error code anywhere on the forum and was hoping you might be able to help me determine where I should start troubleshooting. I thought perhaps it was an error concerning memory requirements, so I altered the chunk size from the default to a few larger sequence lengths (I've tried 1e6, 1e7, and 999,999,999 - all produce the same outcome). I've tried running the program with parallel support using either openMPI or mpich. I've tried running on a single node using 24 cpus and 120g of RAM. It always stalls at the same step.

Interestingly, one of the 22 large scaffolds always finishes and produces the .maker.proteins.fasta, .maker.transcripts.fasta, and .gff files, but the other 21 of 22 large scaffolds fail. This makes me think perhaps it's not a memory issue?

In the case of both the completed and failed scaffolds, the "theVoid.scaffoldX" subdirectory(ies) containing the .rb.cat.gz, .rb.out, .specific.ori.out, .specific.cat.gz, .specific.out, te_proteins*fasta.repeat runner, the est *fasta.blastn, the altest *fasta.tblastx, and protein *fasta.blastx files are all present (and appear finished from what I can tell).
However, the particular contents in the parent directory to the "theVoid.scaffold" folder differ. For the failed scaffolds, the contents generally always look something like this (that is, they stall with the same kind of files produced):
```
0
evidence_0.gff
query.fasta
query.masked.fasta
query.masked.fasta.index
query.masked.gff
run.log.child.0
scaffold22.0.final.section
scaffold22.0.pred.raw.section
scaffold22.0.raw.section
scaffold22.gff.ann
scaffold22.gff.def
scaffold22.gff.seq
```

For the completed scaffold, there are many more files created:
```
0
10
100
20
30
40
50
60
70
80
90
evidence_0.gff
evidence_10.gff
evidence_1.gff
evidence_2.gff
evidence_3.gff
evidence_4.gff
evidence_5.gff
evidence_6.gff
evidence_7.gff
evidence_8.gff
evidence_9.gff
query.fasta
query.masked.fasta
query.masked.fasta.index
query.masked.gff
run.log.child.0
run.log.child.1
run.log.child.10
run.log.child.2
run.log.child.3
run.log.child.4
run.log.child.5
run.log.child.6
run.log.child.7
run.log.child.8
run.log.child.9
scaffold4.0-1.raw.section
scaffold4.0.final.section
scaffold4.0.pred.raw.section
scaffold4.0.raw.section
scaffold4.10.final.section
scaffold4.10.pred.raw.section
scaffold4.10.raw.section
scaffold4.1-2.raw.section
scaffold4.1.final.section
scaffold4.1.pred.raw.section
scaffold4.1.raw.section
scaffold4.2-3.raw.section
scaffold4.2.final.section
scaffold4.2.pred.raw.section
scaffold4.2.raw.section
scaffold4.3-4.raw.section
scaffold4.3.final.section
scaffold4.3.pred.raw.section
scaffold4.3.raw.section
scaffold4.4-5.raw.section
scaffold4.4.final.section
scaffold4.4.pred.raw.section
scaffold4.4.raw.section
scaffold4.5-6.raw.section
scaffold4.5.final.section
scaffold4.5.pred.raw.section
scaffold4.5.raw.section
scaffold4.6-7.raw.section
scaffold4.6.final.section
scaffold4.6.pred.raw.section
scaffold4.6.raw.section
scaffold4.7-8.raw.section
scaffold4.7.final.section
scaffold4.7.pred.raw.section
scaffold4.7.raw.section
scaffold4.8-9.raw.section
scaffold4.8.final.section
scaffold4.8.pred.raw.section
scaffold4.8.raw.section
scaffold4.9-10.raw.section
scaffold4.9.final.section
scaffold4.9.pred.raw.section
scaffold4.9.raw.section
```

Thanks for any troubleshooting tips you can offer.

Cheers,
Devon

--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org



--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
<fail-1a.log.gz><fail-1b.log.gz><run1_maker_opts.ctl><run1_slurm.sh>



--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork



--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork

_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: short scaffolds finish, long scaffolds (almost always) fail

Devon O'Rourke
Hi once again Carson,
Our administrators tried installing Maker with a different version of OpenMPI, and the change allowed the job to complete normally. The change was from a newer version (3.1.3) to an older version (1.6.5) of OpenMPI. I needed to make one tweak to the various MPI arguments you provided after that downgrade in version number, as v-1.6.5 didn't use Vader yet. Other than that, the terms appeared to allow the job to run to completion.
Thanks for your assistance,
Devon

On Fri, Feb 28, 2020 at 7:50 AM Devon O'Rourke <[hidden email]> wrote:
Hi Carson,
I had previously tried sending this email yesterday but received a notification about the text body size being too large. I thought perhaps it was related to the attached log file I sent in the earlier message. You can see the same file here: https://osf.io/cuxg8/download.
Thanks!

(previous message below)

....

Two steps forward, one step back, I suppose?
After incorporating the additional MPI-related parameters the job moved further ahead than previous iterations, however it still failed prior to completing the job. It appears that all but the six longest scaffolds were annotated (except for a small few short scaffolds which simply weren't finished by the time the error triggered the entire run to stop).
I've attached the .log file in hopes that you might find any additional nuggets to help diagnose the problem. Very much appreciate your help.
Devon

On Wed, Feb 26, 2020 at 3:18 PM Carson Holt <[hidden email]> wrote:
For Intel MPI, export an environmental variable right before running MAKER —> "export I_MPI_FABRICS=shm:tcp"

Intel MPI has a similar infiniband segfault issue as OpenMPI when running Perl scripts, but a different workaround.

—Carson


On Feb 26, 2020, at 1:15 PM, Devon O'Rourke <[hidden email]> wrote:

Much appreciated Carson,
I've submitted a job using the parameters you've suggested and will post the outcome. We definitely have two of three MPI options you've described on our cluster (OpenMPI and MPICH2); I'll check on Intel MPI. Happy to advise my cluster admins to use whichever software you prefer (should there be one).
Thanks,
Devon

On Wed, Feb 26, 2020 at 2:54 PM Carson Holt <[hidden email]> wrote:
Try adding these a few options right after ‘mpiexec’ in your batch script (this will fix infiniband related segfaults as well as some fork related segfaults) —> --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0

Also remove the -q in the maker command to get full command lines for subprocesses in the STDERR (allows you to run some commands outside of MAKER to test the source of failures if for example BLASt or Exonerate is causing the segfault).

Example —>
mpiexec --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0 -n 28 /packages/maker/3.01.02-beta/bin/maker -base lu -fix_nucleotides 


One alternate possibility is that OpenMPI is the problem, I’ve seen a few systems where it has an issue with perl itself, and the only way to get around it is to install your own version of perl without perl threads enabled and install MAKER with that version of Perl (then OpenMPI seems to be ok again). If that’s the case it is often easier to switch to MPICH2 or Intel MPI as the MPI launcher if they are available and then reinstall MAKER with that MPI flavor.

—Carson 



On Feb 26, 2020, at 12:36 PM, Devon O'Rourke <[hidden email]> wrote:

Thanks very much for the reply Carson,
I've attached few files file of the most recently failed run: the shell script submitted to Slurm, the _opts.ctl file, and the pair of log files generated from the job. The reason there are a 1a and 1b pair of files is that I had initially set the number of cpus in the _opts.ctl file to "60", but then tried re-running it after setting it to "28". Both seem to have the same result.
I certainly have access to more memory if needed. I'm using a pretty typical (I think?) cluster that controls jobs with Slurm using a Lustre file system - it's the main high performance computing center at our university. I have access to plenty of nodes that contain about 120-150g of RAM each with between 24-28 cpus each, as well a handful of higher memory nodes with about 1.5tb of RAM. As I'm writing this email, I've submitted a similar Maker job (i.e. same fasta/gff inputs) requesting 200g of RAM over 32 cpus; if that fails, I could certainly run again with even more memory.
Appreciate your insights; hope the weather in UT is filled with sun or snow or both.
Devon

On Wed, Feb 26, 2020 at 2:10 PM Carson Holt <[hidden email]> wrote:
If running under MPI, the reason for a failure may be further back in the STDERR (failures tend snowball other failures, so the initial cause is often way back). If you can capture the STDERR and send it, that would be the most informative. If its memory, you can also set all the blast_depth parameters in maker_botpts.ctl to a value like 20.

—Carson



On Feb 19, 2020, at 1:54 PM, Devon O'Rourke <[hidden email]> wrote:

Hello,

I apologize for not posting directly to the archived forum but it appears that the option to enter new posts is disabled. Perhaps this is by design so emails go directly to this address. I hope this is what you are looking for.

Thank you for your continued support of Maker and your responses to the forum posts. I have been running Maker (V3.01.02-beta) to annotate a mammalian genome that consists of 22 chromosome-length scaffolds (between ~200-20Mb) and about 10,000 smaller fragments from 1Mb to 10kb in length. In my various tests in running Maker, the vast majority of the smaller fragments are annotated successfully, but nearly all the large scaffolds fail with the same error code when I look at the 'run.log.child.0' file:
```
DIED RANK 0:6:0:0
DIED COUNT 2
```
(the master 'run.log' file just shows "DIED COUNT 2")

I struggled to find this exact error code anywhere on the forum and was hoping you might be able to help me determine where I should start troubleshooting. I thought perhaps it was an error concerning memory requirements, so I altered the chunk size from the default to a few larger sequence lengths (I've tried 1e6, 1e7, and 999,999,999 - all produce the same outcome). I've tried running the program with parallel support using either openMPI or mpich. I've tried running on a single node using 24 cpus and 120g of RAM. It always stalls at the same step.

Interestingly, one of the 22 large scaffolds always finishes and produces the .maker.proteins.fasta, .maker.transcripts.fasta, and .gff files, but the other 21 of 22 large scaffolds fail. This makes me think perhaps it's not a memory issue?

In the case of both the completed and failed scaffolds, the "theVoid.scaffoldX" subdirectory(ies) containing the .rb.cat.gz, .rb.out, .specific.ori.out, .specific.cat.gz, .specific.out, te_proteins*fasta.repeat runner, the est *fasta.blastn, the altest *fasta.tblastx, and protein *fasta.blastx files are all present (and appear finished from what I can tell).
However, the particular contents in the parent directory to the "theVoid.scaffold" folder differ. For the failed scaffolds, the contents generally always look something like this (that is, they stall with the same kind of files produced):
```
0
evidence_0.gff
query.fasta
query.masked.fasta
query.masked.fasta.index
query.masked.gff
run.log.child.0
scaffold22.0.final.section
scaffold22.0.pred.raw.section
scaffold22.0.raw.section
scaffold22.gff.ann
scaffold22.gff.def
scaffold22.gff.seq
```

For the completed scaffold, there are many more files created:
```
0
10
100
20
30
40
50
60
70
80
90
evidence_0.gff
evidence_10.gff
evidence_1.gff
evidence_2.gff
evidence_3.gff
evidence_4.gff
evidence_5.gff
evidence_6.gff
evidence_7.gff
evidence_8.gff
evidence_9.gff
query.fasta
query.masked.fasta
query.masked.fasta.index
query.masked.gff
run.log.child.0
run.log.child.1
run.log.child.10
run.log.child.2
run.log.child.3
run.log.child.4
run.log.child.5
run.log.child.6
run.log.child.7
run.log.child.8
run.log.child.9
scaffold4.0-1.raw.section
scaffold4.0.final.section
scaffold4.0.pred.raw.section
scaffold4.0.raw.section
scaffold4.10.final.section
scaffold4.10.pred.raw.section
scaffold4.10.raw.section
scaffold4.1-2.raw.section
scaffold4.1.final.section
scaffold4.1.pred.raw.section
scaffold4.1.raw.section
scaffold4.2-3.raw.section
scaffold4.2.final.section
scaffold4.2.pred.raw.section
scaffold4.2.raw.section
scaffold4.3-4.raw.section
scaffold4.3.final.section
scaffold4.3.pred.raw.section
scaffold4.3.raw.section
scaffold4.4-5.raw.section
scaffold4.4.final.section
scaffold4.4.pred.raw.section
scaffold4.4.raw.section
scaffold4.5-6.raw.section
scaffold4.5.final.section
scaffold4.5.pred.raw.section
scaffold4.5.raw.section
scaffold4.6-7.raw.section
scaffold4.6.final.section
scaffold4.6.pred.raw.section
scaffold4.6.raw.section
scaffold4.7-8.raw.section
scaffold4.7.final.section
scaffold4.7.pred.raw.section
scaffold4.7.raw.section
scaffold4.8-9.raw.section
scaffold4.8.final.section
scaffold4.8.pred.raw.section
scaffold4.8.raw.section
scaffold4.9-10.raw.section
scaffold4.9.final.section
scaffold4.9.pred.raw.section
scaffold4.9.raw.section
```

Thanks for any troubleshooting tips you can offer.

Cheers,
Devon

--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org



--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
<fail-1a.log.gz><fail-1b.log.gz><run1_maker_opts.ctl><run1_slurm.sh>



--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork



--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork


--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork

_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: short scaffolds finish, long scaffolds (almost always) fail

Carson Holt-2
I’m glad you were able to make it work.

Thanks,
Carson


On Feb 29, 2020, at 10:27 AM, Devon O'Rourke <[hidden email]> wrote:

Hi once again Carson,
Our administrators tried installing Maker with a different version of OpenMPI, and the change allowed the job to complete normally. The change was from a newer version (3.1.3) to an older version (1.6.5) of OpenMPI. I needed to make one tweak to the various MPI arguments you provided after that downgrade in version number, as v-1.6.5 didn't use Vader yet. Other than that, the terms appeared to allow the job to run to completion.
Thanks for your assistance,
Devon

On Fri, Feb 28, 2020 at 7:50 AM Devon O'Rourke <[hidden email]> wrote:
Hi Carson,
I had previously tried sending this email yesterday but received a notification about the text body size being too large. I thought perhaps it was related to the attached log file I sent in the earlier message. You can see the same file here: https://osf.io/cuxg8/download.
Thanks!

(previous message below)

....

Two steps forward, one step back, I suppose?
After incorporating the additional MPI-related parameters the job moved further ahead than previous iterations, however it still failed prior to completing the job. It appears that all but the six longest scaffolds were annotated (except for a small few short scaffolds which simply weren't finished by the time the error triggered the entire run to stop).
I've attached the .log file in hopes that you might find any additional nuggets to help diagnose the problem. Very much appreciate your help.
Devon

On Wed, Feb 26, 2020 at 3:18 PM Carson Holt <[hidden email]> wrote:
For Intel MPI, export an environmental variable right before running MAKER —> "export I_MPI_FABRICS=shm:tcp"

Intel MPI has a similar infiniband segfault issue as OpenMPI when running Perl scripts, but a different workaround.

—Carson


On Feb 26, 2020, at 1:15 PM, Devon O'Rourke <[hidden email]> wrote:

Much appreciated Carson,
I've submitted a job using the parameters you've suggested and will post the outcome. We definitely have two of three MPI options you've described on our cluster (OpenMPI and MPICH2); I'll check on Intel MPI. Happy to advise my cluster admins to use whichever software you prefer (should there be one).
Thanks,
Devon

On Wed, Feb 26, 2020 at 2:54 PM Carson Holt <[hidden email]> wrote:
Try adding these a few options right after ‘mpiexec’ in your batch script (this will fix infiniband related segfaults as well as some fork related segfaults) —> --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0

Also remove the -q in the maker command to get full command lines for subprocesses in the STDERR (allows you to run some commands outside of MAKER to test the source of failures if for example BLASt or Exonerate is causing the segfault).

Example —>
mpiexec --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0 -n 28 /packages/maker/3.01.02-beta/bin/maker -base lu -fix_nucleotides 


One alternate possibility is that OpenMPI is the problem, I’ve seen a few systems where it has an issue with perl itself, and the only way to get around it is to install your own version of perl without perl threads enabled and install MAKER with that version of Perl (then OpenMPI seems to be ok again). If that’s the case it is often easier to switch to MPICH2 or Intel MPI as the MPI launcher if they are available and then reinstall MAKER with that MPI flavor.

—Carson 



On Feb 26, 2020, at 12:36 PM, Devon O'Rourke <[hidden email]> wrote:

Thanks very much for the reply Carson,
I've attached few files file of the most recently failed run: the shell script submitted to Slurm, the _opts.ctl file, and the pair of log files generated from the job. The reason there are a 1a and 1b pair of files is that I had initially set the number of cpus in the _opts.ctl file to "60", but then tried re-running it after setting it to "28". Both seem to have the same result.
I certainly have access to more memory if needed. I'm using a pretty typical (I think?) cluster that controls jobs with Slurm using a Lustre file system - it's the main high performance computing center at our university. I have access to plenty of nodes that contain about 120-150g of RAM each with between 24-28 cpus each, as well a handful of higher memory nodes with about 1.5tb of RAM. As I'm writing this email, I've submitted a similar Maker job (i.e. same fasta/gff inputs) requesting 200g of RAM over 32 cpus; if that fails, I could certainly run again with even more memory.
Appreciate your insights; hope the weather in UT is filled with sun or snow or both.
Devon

On Wed, Feb 26, 2020 at 2:10 PM Carson Holt <[hidden email]> wrote:
If running under MPI, the reason for a failure may be further back in the STDERR (failures tend snowball other failures, so the initial cause is often way back). If you can capture the STDERR and send it, that would be the most informative. If its memory, you can also set all the blast_depth parameters in maker_botpts.ctl to a value like 20.

—Carson



On Feb 19, 2020, at 1:54 PM, Devon O'Rourke <[hidden email]> wrote:

Hello,

I apologize for not posting directly to the archived forum but it appears that the option to enter new posts is disabled. Perhaps this is by design so emails go directly to this address. I hope this is what you are looking for.

Thank you for your continued support of Maker and your responses to the forum posts. I have been running Maker (V3.01.02-beta) to annotate a mammalian genome that consists of 22 chromosome-length scaffolds (between ~200-20Mb) and about 10,000 smaller fragments from 1Mb to 10kb in length. In my various tests in running Maker, the vast majority of the smaller fragments are annotated successfully, but nearly all the large scaffolds fail with the same error code when I look at the 'run.log.child.0' file:
```
DIED RANK 0:6:0:0
DIED COUNT 2
```
(the master 'run.log' file just shows "DIED COUNT 2")

I struggled to find this exact error code anywhere on the forum and was hoping you might be able to help me determine where I should start troubleshooting. I thought perhaps it was an error concerning memory requirements, so I altered the chunk size from the default to a few larger sequence lengths (I've tried 1e6, 1e7, and 999,999,999 - all produce the same outcome). I've tried running the program with parallel support using either openMPI or mpich. I've tried running on a single node using 24 cpus and 120g of RAM. It always stalls at the same step.

Interestingly, one of the 22 large scaffolds always finishes and produces the .maker.proteins.fasta, .maker.transcripts.fasta, and .gff files, but the other 21 of 22 large scaffolds fail. This makes me think perhaps it's not a memory issue?

In the case of both the completed and failed scaffolds, the "theVoid.scaffoldX" subdirectory(ies) containing the .rb.cat.gz, .rb.out, .specific.ori.out, .specific.cat.gz, .specific.out, te_proteins*fasta.repeat runner, the est *fasta.blastn, the altest *fasta.tblastx, and protein *fasta.blastx files are all present (and appear finished from what I can tell).
However, the particular contents in the parent directory to the "theVoid.scaffold" folder differ. For the failed scaffolds, the contents generally always look something like this (that is, they stall with the same kind of files produced):
```
0
evidence_0.gff
query.fasta
query.masked.fasta
query.masked.fasta.index
query.masked.gff
run.log.child.0
scaffold22.0.final.section
scaffold22.0.pred.raw.section
scaffold22.0.raw.section
scaffold22.gff.ann
scaffold22.gff.def
scaffold22.gff.seq
```

For the completed scaffold, there are many more files created:
```
0
10
100
20
30
40
50
60
70
80
90
evidence_0.gff
evidence_10.gff
evidence_1.gff
evidence_2.gff
evidence_3.gff
evidence_4.gff
evidence_5.gff
evidence_6.gff
evidence_7.gff
evidence_8.gff
evidence_9.gff
query.fasta
query.masked.fasta
query.masked.fasta.index
query.masked.gff
run.log.child.0
run.log.child.1
run.log.child.10
run.log.child.2
run.log.child.3
run.log.child.4
run.log.child.5
run.log.child.6
run.log.child.7
run.log.child.8
run.log.child.9
scaffold4.0-1.raw.section
scaffold4.0.final.section
scaffold4.0.pred.raw.section
scaffold4.0.raw.section
scaffold4.10.final.section
scaffold4.10.pred.raw.section
scaffold4.10.raw.section
scaffold4.1-2.raw.section
scaffold4.1.final.section
scaffold4.1.pred.raw.section
scaffold4.1.raw.section
scaffold4.2-3.raw.section
scaffold4.2.final.section
scaffold4.2.pred.raw.section
scaffold4.2.raw.section
scaffold4.3-4.raw.section
scaffold4.3.final.section
scaffold4.3.pred.raw.section
scaffold4.3.raw.section
scaffold4.4-5.raw.section
scaffold4.4.final.section
scaffold4.4.pred.raw.section
scaffold4.4.raw.section
scaffold4.5-6.raw.section
scaffold4.5.final.section
scaffold4.5.pred.raw.section
scaffold4.5.raw.section
scaffold4.6-7.raw.section
scaffold4.6.final.section
scaffold4.6.pred.raw.section
scaffold4.6.raw.section
scaffold4.7-8.raw.section
scaffold4.7.final.section
scaffold4.7.pred.raw.section
scaffold4.7.raw.section
scaffold4.8-9.raw.section
scaffold4.8.final.section
scaffold4.8.pred.raw.section
scaffold4.8.raw.section
scaffold4.9-10.raw.section
scaffold4.9.final.section
scaffold4.9.pred.raw.section
scaffold4.9.raw.section
```

Thanks for any troubleshooting tips you can offer.

Cheers,
Devon

--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org



--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
<fail-1a.log.gz><fail-1b.log.gz><run1_maker_opts.ctl><run1_slurm.sh>



--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork



--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork


--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork


_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org