Problem with OpenFabrics and infiniband

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with OpenFabrics and infiniband

UMD Bioinformatics
Hello,

I’ve had my IT folks install maker on our cluster at UMD. I’m having a SEGFAULT error when running maker on inifiniband nodes vs gigE nodes. According to the logs this appears to be an issue with forks but I’m not sure how to fix this. I would simply use the gigE nodes but we are in the process of updating everything to inifiniband so I’ll need to address this issue as some point. I’ve attached the error log from the MPI run as well as commentary from my HPCC team. 

IT suggestions

If you look at the top of the error log for the problematic job, it clearly
warns of an issue with doing 'fork's within openmpi/openfabrics framework.

In particular, the use of the fork system call is only partially supported
in the OpenFabrics software (this is the drivers, etc for the infiniband
connections). See e.g. 
http://www.open-mpi.org/faq/?category=openfabrics#ofa-fork
for more information. In particular the paragraphs starting with the
sentence with the red highlighted "it does not mean that your fork()-calling 
application is safe". (The kernel, openMPI version, and OFED version are 
sufficiently recent to mean that there is _some_ fork support).

The fact that the job runs over gigE but not IB, in conjunction with the
warning from openmpi, strongly suggests that this is the issue that you are 
encountering. I suspect that maker touches registered memory before the fork,
which would result in a segfault (matching what was observed).

You can try adding the arguments
--mca mpi_warn_on_fork 0 
to the mpirun command, just in case the crash was somehow caused by openmpi's
warning, but I would not hold out much hope for that.

###UPDATE### This does not fix the problem.


Basically, it looks like maker uses some system calls like fork in a manner
which is incompatible with the current OpenFabrics software, and thus will
not work with infiniband. This situation is likely to remain until either
maker changes to be compatible with OFED, or OFED's support for the fork
system call is broadened.




_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

maker_error_openfabrics.txt (34K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Problem with OpenFabrics and infiniband

Carson Hinton Holt
It’s a little more complicated than that.  MAKER is written in Perl, and Perl doesn’t give me the low level access that a language like C would for controlling memory access (I don’t control that).  All I get is Perl’s standard implementation of forks.  So it’s not really a matter of MAKER changing, it would be a matter of changing Perl itself (which I have no power over, and I don’t think will be changing anytime soon).

For now you just have to add this flag to OpenMPI when running MAKER with mpiexec —>  -mca btl ^openib

Example :
mpiexec -mca btl ^openib -n 20 maker


Thanks,
Carson


From: UMD Bioinformatics <[hidden email]>
Date: Thursday, February 27, 2014 at 9:46 AM
To: <[hidden email]>
Subject: Problem with OpenFabrics and infiniband

Hello,

I’ve had my IT folks install maker on our cluster at UMD. I’m having a SEGFAULT error when running maker on inifiniband nodes vs gigE nodes. According to the logs this appears to be an issue with forks but I’m not sure how to fix this. I would simply use the gigE nodes but we are in the process of updating everything to inifiniband so I’ll need to address this issue as some point. I’ve attached the error log from the MPI run as well as commentary from my HPCC team. 

IT suggestions

If you look at the top of the error log for the problematic job, it clearly
warns of an issue with doing 'fork's within openmpi/openfabrics framework.

In particular, the use of the fork system call is only partially supported
in the OpenFabrics software (this is the drivers, etc for the infiniband
connections). See e.g. 
http://www.open-mpi.org/faq/?category=openfabrics#ofa-fork
for more information. In particular the paragraphs starting with the
sentence with the red highlighted "it does not mean that your fork()-calling 
application is safe". (The kernel, openMPI version, and OFED version are 
sufficiently recent to mean that there is _some_ fork support).

The fact that the job runs over gigE but not IB, in conjunction with the
warning from openmpi, strongly suggests that this is the issue that you are 
encountering. I suspect that maker touches registered memory before the fork,
which would result in a segfault (matching what was observed).

You can try adding the arguments
--mca mpi_warn_on_fork 0 
to the mpirun command, just in case the crash was somehow caused by openmpi's
warning, but I would not hold out much hope for that.

###UPDATE### This does not fix the problem.


Basically, it looks like maker uses some system calls like fork in a manner
which is incompatible with the current OpenFabrics software, and thus will
not work with infiniband. This situation is likely to remain until either
maker changes to be compatible with OFED, or OFED's support for the fork
system call is broadened.



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Problem with OpenFabrics and infiniband

UMD Bioinformatics
Hi Carson,

Thanks that fixed the issue. 

Cheers
Ian

On Feb 27, 2014, at 1:09 PM, Carson Holt <[hidden email]> wrote:

It’s a little more complicated than that.  MAKER is written in Perl, and Perl doesn’t give me the low level access that a language like C would for controlling memory access (I don’t control that).  All I get is Perl’s standard implementation of forks.  So it’s not really a matter of MAKER changing, it would be a matter of changing Perl itself (which I have no power over, and I don’t think will be changing anytime soon).

For now you just have to add this flag to OpenMPI when running MAKER with mpiexec —>  -mca btl ^openib

Example :
mpiexec -mca btl ^openib -n 20 maker


Thanks,
Carson


From: UMD Bioinformatics <[hidden email]>
Date: Thursday, February 27, 2014 at 9:46 AM
To: <[hidden email]>
Subject: Problem with OpenFabrics and infiniband

Hello,

I’ve had my IT folks install maker on our cluster at UMD. I’m having a SEGFAULT error when running maker on inifiniband nodes vs gigE nodes. According to the logs this appears to be an issue with forks but I’m not sure how to fix this. I would simply use the gigE nodes but we are in the process of updating everything to inifiniband so I’ll need to address this issue as some point. I’ve attached the error log from the MPI run as well as commentary from my HPCC team. 

IT suggestions

If you look at the top of the error log for the problematic job, it clearly
warns of an issue with doing 'fork's within openmpi/openfabrics framework.

In particular, the use of the fork system call is only partially supported
in the OpenFabrics software (this is the drivers, etc for the infiniband
connections). See e.g. 
http://www.open-mpi.org/faq/?category=openfabrics#ofa-fork
for more information. In particular the paragraphs starting with the
sentence with the red highlighted "it does not mean that your fork()-calling 
application is safe". (The kernel, openMPI version, and OFED version are 
sufficiently recent to mean that there is _some_ fork support).

The fact that the job runs over gigE but not IB, in conjunction with the
warning from openmpi, strongly suggests that this is the issue that you are 
encountering. I suspect that maker touches registered memory before the fork,
which would result in a segfault (matching what was observed).

You can try adding the arguments
--mca mpi_warn_on_fork 0 
to the mpirun command, just in case the crash was somehow caused by openmpi's
warning, but I would not hold out much hope for that.

###UPDATE### This does not fix the problem.


Basically, it looks like maker uses some system calls like fork in a manner
which is incompatible with the current OpenFabrics software, and thus will
not work with infiniband. This situation is likely to remain until either
maker changes to be compatible with OFED, or OFED's support for the fork
system call is broadened.




_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org