MPI selection

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

MPI selection

admin
Hi, we have now three versions of MPI installed on our cluster, OpenMPI,
MPICH, and MVAPICH2.  Since we have infiniband, the MVAPICH2 is working
best with MPI test programs.  MPICH should support infiniband too but
currently there are some seg faults with that we are trying to resolve.

On our cluster we have ~/.mpi-selection file which allows users to pick
the MPI installation to use, and sets appropriate PATH and
LD_LIBRARY_PATH variables.

I am looking through the Maker MPI instructions, and it seems that a
certain mpicc and mpi.h must be chosen during installation.  So if
originally, Maker was installed with MPICH, then would I have to
reinstall it if users want to use MVAPICH2?  Or is there config file
somewhere I can update so I don't have to reinstall Maker?  Or does
nothing need to be done and we can rely on PATH and LD_LIBRARY_PATH
variables pointing to correct mpicc and libmpi.so (mpi.h is in include
directory)?

Thanks

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: MPI selection

Carson Holt-2
The libraries used by MVAPICH2, Intel MPI, and OpenMPI to access infiniband have a known bug. For performance reasons, infiniband libraries use registered memory in a way that makes it impossible to do system calls to external programs under MPI (doing so results in seg faults). MAKER has to call out to external programs like BLAST, exonerate, etc., so it triggers this bug.

The infiniband bug is well known, and unfortunately will not be fixed because fixing it causes infiniband to lose some advertised features like direct memory access. As a work around OpenMPI and Intel MPI allow you to disable infiniband libraries via command line flags and use IP over infiniband instead (i.e. they let you drop infiniband features on demand so that your code will run). However MVAPICH2 does not provide the same option. As a result you cannot use MAKER or any MPI program that does system calls to external programs with MVAPICH2 (it results in seg faults). But you can use all other MPI flavors with the appropriate flags detailed below:

#For OpenMPI, use as follows (the example assumes ib0 is your ip over infiniband adapter)
export LD_PRELOAD=/path/to/openmpi/libmpi.so
mpiexec --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0 maker

#For Intel MPI set these environmental variables before launch
export I_MPI_FABRICS='shm:tcp'
export I_MPI_HYDRA_IFACE='ib0'
mpiexec maker

#For MPICH, nothing is needed as the Infiniband libraries are always disabled, but you can specifically tell it to use the ib0 adapter as the communicator
mpiexec -iface ib0 maker

—Carson


> On Jan 29, 2018, at 4:08 PM, [hidden email] wrote:
>
> Hi, we have now three versions of MPI installed on our cluster, OpenMPI, MPICH, and MVAPICH2.  Since we have infiniband, the MVAPICH2 is working best with MPI test programs.  MPICH should support infiniband too but currently there are some seg faults with that we are trying to resolve.
>
> On our cluster we have ~/.mpi-selection file which allows users to pick the MPI installation to use, and sets appropriate PATH and LD_LIBRARY_PATH variables.
>
> I am looking through the Maker MPI instructions, and it seems that a certain mpicc and mpi.h must be chosen during installation.  So if originally, Maker was installed with MPICH, then would I have to reinstall it if users want to use MVAPICH2?  Or is there config file somewhere I can update so I don't have to reinstall Maker?  Or does nothing need to be done and we can rely on PATH and LD_LIBRARY_PATH variables pointing to correct mpicc and libmpi.so (mpi.h is in include directory)?
>
> Thanks
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: MPI selection

admin
Carson Holt wrote on 01/30/2018 09:47 AM:
 > The libraries used by MVAPICH2, Intel MPI, and OpenMPI to access
infiniband have a known bug. For performance reasons, infiniband
libraries use registered memory in a way that makes it impossible to do
system calls to external programs under MPI (doing so results in seg
faults). MAKER has to call out to external programs like BLAST,
exonerate, etc., so it triggers this bug.
 > The infiniband bug is well known, and unfortunately will not be fixed
because fixing it causes infiniband to lose some advertised features
like direct memory access.


Well that stinks!  Maybe that's why we got such a good deal on
new-old-stock infiniband equipment!  Still it has allowed us to use full
speed of our NFS RAIDs, which has been nice.  I will try with using ib0,
the speed is still about 10Gb, but I was under the impression using
IPoIB would cause packet loss or other problems...

Thanks for clearing that up.  So is there a fabric/protocol you would
recommend for clusters running maker?


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: MPI selection

Carson Holt-2
MAKER does not really move a lot of data with MPI, it’s just moving around command lines and small variables. So not getting full infiniband performance will not hurt you. I doubt you see any issues using ib0. For MPI flavor, I get the best performance with Intel MPI followed by OpenMPI. Overall you will find that MAKER is IO bound as opposed to CPU or communications bound. So pointing it at your best performing network based storage will be the greatest performance factor (if you have Lustre storage, point it there for example). Pull back on job size and count if other users have issues accessing the disk (too many jobs can bring NFS to it’s knees). The one suggestion I have as far as job size, it to keep jobs sizes under 200 CPU cores. Over that, you will get better performance by splitting up datasets and submitting multiple job. Also MAKER keeps a log of it’s progress, so you can kill jobs or restart failed jobs, and they pick up right where they left off.

—Carson



> On Jan 30, 2018, at 10:24 AM, [hidden email] wrote:
>
> Carson Holt wrote on 01/30/2018 09:47 AM:
> > The libraries used by MVAPICH2, Intel MPI, and OpenMPI to access infiniband have a known bug. For performance reasons, infiniband libraries use registered memory in a way that makes it impossible to do system calls to external programs under MPI (doing so results in seg faults). MAKER has to call out to external programs like BLAST, exonerate, etc., so it triggers this bug.
> > The infiniband bug is well known, and unfortunately will not be fixed because fixing it causes infiniband to lose some advertised features like direct memory access.
>
>
> Well that stinks!  Maybe that's why we got such a good deal on new-old-stock infiniband equipment!  Still it has allowed us to use full speed of our NFS RAIDs, which has been nice.  I will try with using ib0, the speed is still about 10Gb, but I was under the impression using IPoIB would cause packet loss or other problems...
>
> Thanks for clearing that up.  So is there a fabric/protocol you would recommend for clusters running maker?
>
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: MPI selection

admin
Thanks Carson, one other question though.

When you mentioned to keep jobs under 200 CPU cores, did you mean as in
what is emulated to the OS, or physical hardware?  We are using
hyperthreading which emulates more CPUs to the OS, and so we have about
256 emulated CPUs available.  Some of our applications are better
optimized for this, so I keep it that way.

So suppose we keep hyperthreading enabled, should we specify in the
machine list file that mpiexec uses to only use 128 of the emulated
cores?  We have noticed with using all 256 hyperthreaded cores that the
load can get high, although everything still works great.

Thanks


Chandler / Systems Administrator
Arizona Genomics Institute
www.genome.arizona.edu


Carson Holt wrote on 01/30/2018 10:37 AM:

> MAKER does not really move a lot of data with MPI, it’s just moving around command lines and small variables. So not getting full infiniband performance will not hurt you. I doubt you see any issues using ib0. For MPI flavor, I get the best performance with Intel MPI followed by OpenMPI. Overall you will find that MAKER is IO bound as opposed to CPU or communications bound. So pointing it at your best performing network based storage will be the greatest performance factor (if you have Lustre storage, point it there for example). Pull back on job size and count if other users have issues accessing the disk (too many jobs can bring NFS to it’s knees). The one suggestion I have as far as job size, it to keep jobs sizes under 200 CPU cores. Over that, you will get better performance by splitting up datasets and submitting multiple job. Also MAKER keeps a log of it’s progress, so you can kill jobs or restart failed jobs, and they pick up right where they left off.
>
> —Carson
>
>
>
>> On Jan 30, 2018, at 10:24 AM, [hidden email] wrote:
>>
>> Carson Holt wrote on 01/30/2018 09:47 AM:
>>> The libraries used by MVAPICH2, Intel MPI, and OpenMPI to access infiniband have a known bug. For performance reasons, infiniband libraries use registered memory in a way that makes it impossible to do system calls to external programs under MPI (doing so results in seg faults). MAKER has to call out to external programs like BLAST, exonerate, etc., so it triggers this bug.
>>> The infiniband bug is well known, and unfortunately will not be fixed because fixing it causes infiniband to lose some advertised features like direct memory access.
>>
>>
>> Well that stinks!  Maybe that's why we got such a good deal on new-old-stock infiniband equipment!  Still it has allowed us to use full speed of our NFS RAIDs, which has been nice.  I will try with using ib0, the speed is still about 10Gb, but I was under the impression using IPoIB would cause packet loss or other problems...
>>
>> Thanks for clearing that up.  So is there a fabric/protocol you would recommend for clusters running maker?
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> [hidden email]
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> .
>

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: MPI selection

Carson Holt-2
Yes, you can oversubscribe and match the hyperthread count (you will need ~1GB of RAM per process though). But you should still keep the overall number of MPI processes given to mpiexec below ~200. This is because, the way I structured the MPI controller in MAKER is that I have one manager process and all the other processes are workers. So if there are too many workers, they can overwhelm the manager with communications, and scaling efficiency drops. So starting another MAKER job, launches a new manager with his own workers. In that way, you get around the communication bottleneck by launching multiple large MAKER jobs and scaling efficiency returns to near linear. You can run with multiple jobs until you start hitting IO bottlenecks (MAKER will perform high IOPS but not necessarily high bandwidth).

—Carson




> On Feb 8, 2018, at 5:53 PM, Chandler <[hidden email]> wrote:
>
> Thanks Carson, one other question though.
>
> When you mentioned to keep jobs under 200 CPU cores, did you mean as in what is emulated to the OS, or physical hardware?  We are using hyperthreading which emulates more CPUs to the OS, and so we have about 256 emulated CPUs available.  Some of our applications are better optimized for this, so I keep it that way.
>
> So suppose we keep hyperthreading enabled, should we specify in the machine list file that mpiexec uses to only use 128 of the emulated cores?  We have noticed with using all 256 hyperthreaded cores that the load can get high, although everything still works great.
>
> Thanks
>
>
> Chandler / Systems Administrator
> Arizona Genomics Institute
> www.genome.arizona.edu
>
>
> Carson Holt wrote on 01/30/2018 10:37 AM:
>> MAKER does not really move a lot of data with MPI, it’s just moving around command lines and small variables. So not getting full infiniband performance will not hurt you. I doubt you see any issues using ib0. For MPI flavor, I get the best performance with Intel MPI followed by OpenMPI. Overall you will find that MAKER is IO bound as opposed to CPU or communications bound. So pointing it at your best performing network based storage will be the greatest performance factor (if you have Lustre storage, point it there for example). Pull back on job size and count if other users have issues accessing the disk (too many jobs can bring NFS to it’s knees). The one suggestion I have as far as job size, it to keep jobs sizes under 200 CPU cores. Over that, you will get better performance by splitting up datasets and submitting multiple job. Also MAKER keeps a log of it’s progress, so you can kill jobs or restart failed jobs, and they pick up right where they left off.
>> —Carson
>>> On Jan 30, 2018, at 10:24 AM, [hidden email] wrote:
>>>
>>> Carson Holt wrote on 01/30/2018 09:47 AM:
>>>> The libraries used by MVAPICH2, Intel MPI, and OpenMPI to access infiniband have a known bug. For performance reasons, infiniband libraries use registered memory in a way that makes it impossible to do system calls to external programs under MPI (doing so results in seg faults). MAKER has to call out to external programs like BLAST, exonerate, etc., so it triggers this bug.
>>>> The infiniband bug is well known, and unfortunately will not be fixed because fixing it causes infiniband to lose some advertised features like direct memory access.
>>>
>>>
>>> Well that stinks!  Maybe that's why we got such a good deal on new-old-stock infiniband equipment!  Still it has allowed us to use full speed of our NFS RAIDs, which has been nice.  I will try with using ib0, the speed is still about 10Gb, but I was under the impression using IPoIB would cause packet loss or other problems...
>>>
>>> Thanks for clearing that up.  So is there a fabric/protocol you would recommend for clusters running maker?
>>>
>>>
>>> _______________________________________________
>>> maker-devel mailing list
>>> [hidden email]
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>> .
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org