Parallelization without MPI.

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Parallelization without MPI.

trs38
I have had difficulty getting MPI to co-operate with our cluster.
However I read that you:

"Can also start MAKER multiple times and get parallelization without
MPI. Subsequent MAKER instances will detect already running instances
and integrate seamlessly"

Since this is discussed so briefly in the tutorial I just wanted to
confirm that it is true and check whether there is any particular
advice for how to go about it.  I have been looking at my log files
and see that quite a few contigs are failing and I wondered if this
was because some special technique was needed.

Many thanks

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Parallelization without MPI.

trs38
My datastore.log shows the same contig as FINISHED many times,
presumably this means that something is going wrong.  How does the
detection work?  Is it on the file system level or does it require
that the processes be running on the same physical server.

Many thanks.

On Mar 9, 12:50 pm, trs38 <[hidden email]> wrote:

> I have had difficulty getting MPI to co-operate with our cluster.
> However I read that you:
>
> "Can also start MAKER multiple times and get parallelization without
> MPI. Subsequent MAKER instances will detect already running instances
> and integrate seamlessly"
>
> Since this is discussed so briefly in the tutorial I just wanted to
> confirm that it is true and check whether there is any particular
> advice for how to go about it.  I have been looking at my log files
> and see that quite a few contigs are failing and I wondered if this
> was because some special technique was needed.
>
> Many thanks
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Parallelization without MPI.

Carson Hinton Holt
Re: [maker-devel] Parallelization without MPI. First, which version of MAKER are you using?

Failures can be for multiple reasons.  Check the STDERR output and look for messages with the “ERROR” tag and a longer message.

Having multiple FINISHED lines is not a problem, because they do not get rerun.  It is just an indication that they were detected as finished at the start, a message was printed, and then the process moved on.   However, having multiple rather than a single message when running several processes this way  is indicative of an older version of MAKER, hence my first question.

In general, you should be able to easily start as many as 20 processes in the same directory without MPI.  Beyond that you will still get parallelization, but at much less efficiency than MPI primarily because of IO overhead.  I often use a mixed strategy, starting multiple MPI processes in the same directory, so it takes takes advantage of both the MPI and the non-MPI parallelization.  As proof of principle and to test the limits of this technique, I have successfully run MAKER on over 1,700 processors simultaneously by starting multiple jobs of 64 processors each.

--Carson


On 3/9/11 8:13 AM, "trs38" <trs38@...> wrote:

My datastore.log shows the same contig as FINISHED many times,
presumably this means that something is going wrong.  How does the
detection work?  Is it on the file system level or does it require
that the processes be running on the same physical server.

Many thanks.

On Mar 9, 12:50 pm, trs38 <tr...@...> wrote:
> I have had difficulty getting MPI to co-operate with our cluster.
> However I read that you:
>
> "Can also start MAKER multiple times and get parallelization without
> MPI. Subsequent MAKER instances will detect already running instances
> and integrate seamlessly"
>
> Since this is discussed so briefly in the tutorial I just wanted to
> confirm that it is true and check whether there is any particular
> advice for how to go about it.  I have been looking at my log files
> and see that quite a few contigs are failing and I wondered if this
> was because some special technique was needed.
>
> Many thanks
>
> _______________________________________________
> maker-devel mailing list
> maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
maker-devel@...
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Parallelization without MPI.

trs38
Thank you very much for the helpful reply. I am using Maker 2.09.  MPI
is working when all processes run on the same server so your hybrid
strategy would work well for me.

However I've been having some trouble with these errors:

ERROR: Cannot refresh the lock as it has apparently been broken

FATAL ERROR

When attempting to start multiple mpi_maker groups.  I have
occasionally been able to do so but can identify no pattern as to when
and how this is possible. Would you have any advice for that?

Thanks again
On Mar 9, 3:52 pm, Carson Holt <[hidden email]> wrote:

> First, which version of MAKER are you using?
>
> Failures can be for multiple reasons.  Check the STDERR output and look for messages with the "ERROR" tag and a longer message.
>
> Having multiple FINISHED lines is not a problem, because they do not get rerun.  It is just an indication that they were detected as finished at the start, a message was printed, and then the process moved on.   However, having multiple rather than a single message when running several processes this way  is indicative of an older version of MAKER, hence my first question.
>
> In general, you should be able to easily start as many as 20 processes in the same directory without MPI.  Beyond that you will still get parallelization, but at much less efficiency than MPI primarily because of IO overhead.  I often use a mixed strategy, starting multiple MPI processes in the same directory, so it takes takes advantage of both the MPI and the non-MPI parallelization.  As proof of principle and to test the limits of this technique, I have successfully run MAKER on over 1,700 processors simultaneously by starting multiple jobs of 64 processors each.
>
> --Carson
>
> On 3/9/11 8:13 AM, "trs38" <[hidden email]> wrote:
>
> My datastore.log shows the same contig as FINISHED many times,
> presumably this means that something is going wrong.  How does the
> detection work?  Is it on the file system level or does it require
> that the processes be running on the same physical server.
>
> Many thanks.
>
> On Mar 9, 12:50 pm, trs38 <[hidden email]> wrote:
>
>
>
>
>
>
>
>
>
> > I have had difficulty getting MPI to co-operate with our cluster.
> > However I read that you:
>
> > "Can also start MAKER multiple times and get parallelization without
> > MPI. Subsequent MAKER instances will detect already running instances
> > and integrate seamlessly"
>
> > Since this is discussed so briefly in the tutorial I just wanted to
> > confirm that it is true and check whether there is any particular
> > advice for how to go about it.  I have been looking at my log files
> > and see that quite a few contigs are failing and I wondered if this
> > was because some special technique was needed.
>
> > Many thanks
>
> > _______________________________________________
> > maker-devel mailing list
> > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Parallelization without MPI.

Carson Hinton Holt
Re: [maker-devel] Parallelization without MPI. To a certain degree you can ignore the error “ERROR: Cannot refresh the lock as it has apparently been broken”.  MAKER will automatically retry any failures later.  Just make sure to set the retry value in the control files.

The error happens because of a delay in your NFS.  This can happen when you have async set as part of the export/mount options.  Basically your NFS mount says the lock was created and returns success even though it really wasn’t created, so another process gets the lock at the same time.  I have MAKER double check lock ownership for this very reason (I don’t trust the system).  This way MAKER won’t step on another processes toes and cause downstream problems.  The error doesn’t affect the data or the processing of the data.

Basically the error means “I thought it was safe for me to process this contig because the system said I have the lock, but apparently another process got the lock as well, so I will move onto the next contig and let the other process take this contig.  I will come back later and check that the other process either finished the contig or released the lock.”

--Carson

On 3/10/11 10:17 AM, "trs38" <trs38@...> wrote:

Thank you very much for the helpful reply. I am using Maker 2.09.  MPI
is working when all processes run on the same server so your hybrid
strategy would work well for me.

However I've been having some trouble with these errors:

ERROR: Cannot refresh the lock as it has apparently been broken

FATAL ERROR

When attempting to start multiple mpi_maker groups.  I have
occasionally been able to do so but can identify no pattern as to when
and how this is possible. Would you have any advice for that?

Thanks again
On Mar 9, 3:52 pm, Carson Holt <carson.h...@...> wrote:
> First, which version of MAKER are you using?
>
> Failures can be for multiple reasons.  Check the STDERR output and look for messages with the "ERROR" tag and a longer message.
>
> Having multiple FINISHED lines is not a problem, because they do not get rerun.  It is just an indication that they were detected as finished at the start, a message was printed, and then the process moved on.   However, having multiple rather than a single message when running several processes this way  is indicative of an older version of MAKER, hence my first question.
>
> In general, you should be able to easily start as many as 20 processes in the same directory without MPI.  Beyond that you will still get parallelization, but at much less efficiency than MPI primarily because of IO overhead.  I often use a mixed strategy, starting multiple MPI processes in the same directory, so it takes takes advantage of both the MPI and the non-MPI parallelization.  As proof of principle and to test the limits of this technique, I have successfully run MAKER on over 1,700 processors simultaneously by starting multiple jobs of 64 processors each.
>
> --Carson
>
> On 3/9/11 8:13 AM, "trs38" <tr...@...> wrote:
>
> My datastore.log shows the same contig as FINISHED many times,
> presumably this means that something is going wrong.  How does the
> detection work?  Is it on the file system level or does it require
> that the processes be running on the same physical server.
>
> Many thanks.
>
> On Mar 9, 12:50 pm, trs38 <tr...@...> wrote:
>
>
>
>
>
>
>
>
>
> > I have had difficulty getting MPI to co-operate with our cluster.
> > However I read that you:
>
> > "Can also start MAKER multiple times and get parallelization without
> > MPI. Subsequent MAKER instances will detect already running instances
> > and integrate seamlessly"
>
> > Since this is discussed so briefly in the tutorial I just wanted to
> > confirm that it is true and check whether there is any particular
> > advice for how to go about it.  I have been looking at my log files
> > and see that quite a few contigs are failing and I wondered if this
> > was because some special technique was needed.
>
> > Many thanks
>
> > _______________________________________________
> > maker-devel mailing list
> > maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
maker-devel@...
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Parallelization without MPI.

trs38
OK, thank you for the information.  Is the FATAL ERROR separate?  They
seem to be correlated and the FATAL ERROR seems pretty catastrophic.
Where would I look to try to get more information on the cause of the
fatal error?

Cheers.

On Mar 10, 5:28 pm, Carson Holt <[hidden email]> wrote:

> To a certain degree you can ignore the error "ERROR: Cannot refresh the lock as it has apparently been broken".  MAKER will automatically retry any failures later.  Just make sure to set the retry value in the control files.
>
> The error happens because of a delay in your NFS.  This can happen when you have async set as part of the export/mount options.  Basically your NFS mount says the lock was created and returns success even though it really wasn't created, so another process gets the lock at the same time.  I have MAKER double check lock ownership for this very reason (I don't trust the system).  This way MAKER won't step on another processes toes and cause downstream problems.  The error doesn't affect the data or the processing of the data.
>
> Basically the error means "I thought it was safe for me to process this contig because the system said I have the lock, but apparently another process got the lock as well, so I will move onto the next contig and let the other process take this contig.  I will come back later and check that the other process either finished the contig or released the lock."
>
> --Carson
>
> On 3/10/11 10:17 AM, "trs38" <[hidden email]> wrote:
>
> Thank you very much for the helpful reply. I am using Maker 2.09.  MPI
> is working when all processes run on the same server so your hybrid
> strategy would work well for me.
>
> However I've been having some trouble with these errors:
>
> ERROR: Cannot refresh the lock as it has apparently been broken
>
> FATAL ERROR
>
> When attempting to start multiple mpi_maker groups.  I have
> occasionally been able to do so but can identify no pattern as to when
> and how this is possible. Would you have any advice for that?
>
> Thanks again
> On Mar 9, 3:52 pm, Carson Holt <[hidden email]> wrote:
>
>
>
> > First, which version of MAKER are you using?
>
> > Failures can be for multiple reasons.  Check the STDERR output and look for messages with the "ERROR" tag and a longer message.
>
> > Having multiple FINISHED lines is not a problem, because they do not get rerun.  It is just an indication that they were detected as finished at the start, a message was printed, and then the process moved on.   However, having multiple rather than a single message when running several processes this way  is indicative of an older version of MAKER, hence my first question.
>
> > In general, you should be able to easily start as many as 20 processes in the same directory without MPI.  Beyond that you will still get parallelization, but at much less efficiency than MPI primarily because of IO overhead.  I often use a mixed strategy, starting multiple MPI processes in the same directory, so it takes takes advantage of both the MPI and the non-MPI parallelization.  As proof of principle and to test the limits of this technique, I have successfully run MAKER on over 1,700 processors simultaneously by starting multiple jobs of 64 processors each.
>
> > --Carson
>
> > On 3/9/11 8:13 AM, "trs38" <[hidden email]> wrote:
>
> > My datastore.log shows the same contig as FINISHED many times,
> > presumably this means that something is going wrong.  How does the
> > detection work?  Is it on the file system level or does it require
> > that the processes be running on the same physical server.
>
> > Many thanks.
>
> > On Mar 9, 12:50 pm, trs38 <[hidden email]> wrote:
>
> > > I have had difficulty getting MPI to co-operate with our cluster.
> > > However I read that you:
>
> > > "Can also start MAKER multiple times and get parallelization without
> > > MPI. Subsequent MAKER instances will detect already running instances
> > > and integrate seamlessly"
>
> > > Since this is discussed so briefly in the tutorial I just wanted to
> > > confirm that it is true and check whether there is any particular
> > > advice for how to go about it.  I have been looking at my log files
> > > and see that quite a few contigs are failing and I wondered if this
> > > was because some special technique was needed.
>
> > > Many thanks
>
> > > _______________________________________________
> > > maker-devel mailing list
> > > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > _______________________________________________
> > maker-devel mailing list
> > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > _______________________________________________
> > maker-devel mailing list
> > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Parallelization without MPI.

trs38
One further point - these errors only occur when running MPI-Maker,
not when running standalone MAKER.  In case that sheds any light on
anything.
Best regards

On Mar 11, 12:44 pm, trs38 <[hidden email]> wrote:

> OK, thank you for the information.  Is the FATAL ERROR separate?  They
> seem to be correlated and the FATAL ERROR seems pretty catastrophic.
> Where would I look to try to get more information on the cause of the
> fatal error?
>
> Cheers.
>
> On Mar 10, 5:28 pm, Carson Holt <[hidden email]> wrote:
>
>
>
>
>
>
>
>
>
> > To a certain degree you can ignore the error "ERROR: Cannot refresh the lock as it has apparently been broken".  MAKER will automatically retry any failures later.  Just make sure to set the retry value in the control files.
>
> > The error happens because of a delay in your NFS.  This can happen when you have async set as part of the export/mount options.  Basically your NFS mount says the lock was created and returns success even though it really wasn't created, so another process gets the lock at the same time.  I have MAKER double check lock ownership for this very reason (I don't trust the system).  This way MAKER won't step on another processes toes and cause downstream problems.  The error doesn't affect the data or the processing of the data.
>
> > Basically the error means "I thought it was safe for me to process this contig because the system said I have the lock, but apparently another process got the lock as well, so I will move onto the next contig and let the other process take this contig.  I will come back later and check that the other process either finished the contig or released the lock."
>
> > --Carson
>
> > On 3/10/11 10:17 AM, "trs38" <[hidden email]> wrote:
>
> > Thank you very much for the helpful reply. I am using Maker 2.09.  MPI
> > is working when all processes run on the same server so your hybrid
> > strategy would work well for me.
>
> > However I've been having some trouble with these errors:
>
> > ERROR: Cannot refresh the lock as it has apparently been broken
>
> > FATAL ERROR
>
> > When attempting to start multiple mpi_maker groups.  I have
> > occasionally been able to do so but can identify no pattern as to when
> > and how this is possible. Would you have any advice for that?
>
> > Thanks again
> > On Mar 9, 3:52 pm, Carson Holt <[hidden email]> wrote:
>
> > > First, which version of MAKER are you using?
>
> > > Failures can be for multiple reasons.  Check the STDERR output and look for messages with the "ERROR" tag and a longer message.
>
> > > Having multiple FINISHED lines is not a problem, because they do not get rerun.  It is just an indication that they were detected as finished at the start, a message was printed, and then the process moved on.   However, having multiple rather than a single message when running several processes this way  is indicative of an older version of MAKER, hence my first question.
>
> > > In general, you should be able to easily start as many as 20 processes in the same directory without MPI.  Beyond that you will still get parallelization, but at much less efficiency than MPI primarily because of IO overhead.  I often use a mixed strategy, starting multiple MPI processes in the same directory, so it takes takes advantage of both the MPI and the non-MPI parallelization.  As proof of principle and to test the limits of this technique, I have successfully run MAKER on over 1,700 processors simultaneously by starting multiple jobs of 64 processors each.
>
> > > --Carson
>
> > > On 3/9/11 8:13 AM, "trs38" <[hidden email]> wrote:
>
> > > My datastore.log shows the same contig as FINISHED many times,
> > > presumably this means that something is going wrong.  How does the
> > > detection work?  Is it on the file system level or does it require
> > > that the processes be running on the same physical server.
>
> > > Many thanks.
>
> > > On Mar 9, 12:50 pm, trs38 <[hidden email]> wrote:
>
> > > > I have had difficulty getting MPI to co-operate with our cluster.
> > > > However I read that you:
>
> > > > "Can also start MAKER multiple times and get parallelization without
> > > > MPI. Subsequent MAKER instances will detect already running instances
> > > > and integrate seamlessly"
>
> > > > Since this is discussed so briefly in the tutorial I just wanted to
> > > > confirm that it is true and check whether there is any particular
> > > > advice for how to go about it.  I have been looking at my log files
> > > > and see that quite a few contigs are failing and I wondered if this
> > > > was because some special technique was needed.
>
> > > > Many thanks
>
> > > > _______________________________________________
> > > > maker-devel mailing list
> > > > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > > _______________________________________________
> > > maker-devel mailing list
> > > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > > _______________________________________________
> > > maker-devel mailing list
> > > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > _______________________________________________
> > maker-devel mailing list
> > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > _______________________________________________
> > maker-devel mailing list
> > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Parallelization without MPI.

Carson Hinton Holt
In reply to this post by trs38
Re: [maker-devel] Parallelization without MPI. FATAL ERROR is issued because the lock error causes processing of the contig to fail via a perl die statement (so it is the same not a different error), but the other process that was also running the same contig keeps running, plus the MAKER process that experienced the failure will always go back at the end and retry contigs that died, performing any necessary cleanup as necessary. Previously in MAKER 1.0 these types of errors would kill the process, but there is now a process manager that captures the failures and relaunches the contigs at the end of a run.

To see all contigs that died, look at the master_datastore_index.log.  It will contain FAILED, however you should see RETRY further down the log.  Contigs that continually fail above the specified number of retries will be labeled DIED_SKIPPED_PERMANENT, and will produce individual fasta files for just that contig in the output directory.  You can then run that contig by itself for debugging using that fasta file.  So if a contig fails upstream for system based reasons (in this case you are experiencing an IO related error), and the contig later is listed as FINISHED in the log then the retry was successful, and you don’t need to worry about it.  Because system errors are not consistent or predictable, just retrying is often sufficient to resolve them.

--Carson



On 3/11/11 5:44 AM, "trs38" <trs38@...> wrote:

OK, thank you for the information.  Is the FATAL ERROR separate?  They
seem to be correlated and the FATAL ERROR seems pretty catastrophic.
Where would I look to try to get more information on the cause of the
fatal error?

Cheers.

On Mar 10, 5:28 pm, Carson Holt <carson.h...@...> wrote:
> To a certain degree you can ignore the error "ERROR: Cannot refresh the lock as it has apparently been broken".  MAKER will automatically retry any failures later.  Just make sure to set the retry value in the control files.
>
> The error happens because of a delay in your NFS.  This can happen when you have async set as part of the export/mount options.  Basically your NFS mount says the lock was created and returns success even though it really wasn't created, so another process gets the lock at the same time.  I have MAKER double check lock ownership for this very reason (I don't trust the system).  This way MAKER won't step on another processes toes and cause downstream problems.  The error doesn't affect the data or the processing of the data.
>
> Basically the error means "I thought it was safe for me to process this contig because the system said I have the lock, but apparently another process got the lock as well, so I will move onto the next contig and let the other process take this contig.  I will come back later and check that the other process either finished the contig or released the lock."
>
> --Carson
>
> On 3/10/11 10:17 AM, "trs38" <tr...@...> wrote:
>
> Thank you very much for the helpful reply. I am using Maker 2.09.  MPI
> is working when all processes run on the same server so your hybrid
> strategy would work well for me.
>
> However I've been having some trouble with these errors:
>
> ERROR: Cannot refresh the lock as it has apparently been broken
>
> FATAL ERROR
>
> When attempting to start multiple mpi_maker groups.  I have
> occasionally been able to do so but can identify no pattern as to when
> and how this is possible. Would you have any advice for that?
>
> Thanks again
> On Mar 9, 3:52 pm, Carson Holt <carson.h...@...> wrote:
>
>
>
> > First, which version of MAKER are you using?
>
> > Failures can be for multiple reasons.  Check the STDERR output and look for messages with the "ERROR" tag and a longer message.
>
> > Having multiple FINISHED lines is not a problem, because they do not get rerun.  It is just an indication that they were detected as finished at the start, a message was printed, and then the process moved on.   However, having multiple rather than a single message when running several processes this way  is indicative of an older version of MAKER, hence my first question.
>
> > In general, you should be able to easily start as many as 20 processes in the same directory without MPI.  Beyond that you will still get parallelization, but at much less efficiency than MPI primarily because of IO overhead.  I often use a mixed strategy, starting multiple MPI processes in the same directory, so it takes takes advantage of both the MPI and the non-MPI parallelization.  As proof of principle and to test the limits of this technique, I have successfully run MAKER on over 1,700 processors simultaneously by starting multiple jobs of 64 processors each.
>
> > --Carson
>
> > On 3/9/11 8:13 AM, "trs38" <tr...@...> wrote:
>
> > My datastore.log shows the same contig as FINISHED many times,
> > presumably this means that something is going wrong.  How does the
> > detection work?  Is it on the file system level or does it require
> > that the processes be running on the same physical server.
>
> > Many thanks.
>
> > On Mar 9, 12:50 pm, trs38 <tr...@...> wrote:
>
> > > I have had difficulty getting MPI to co-operate with our cluster.
> > > However I read that you:
>
> > > "Can also start MAKER multiple times and get parallelization without
> > > MPI. Subsequent MAKER instances will detect already running instances
> > > and integrate seamlessly"
>
> > > Since this is discussed so briefly in the tutorial I just wanted to
> > > confirm that it is true and check whether there is any particular
> > > advice for how to go about it.  I have been looking at my log files
> > > and see that quite a few contigs are failing and I wondered if this
> > > was because some special technique was needed.
>
> > > Many thanks
>
> > > _______________________________________________
> > > maker-devel mailing list
> > > maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > _______________________________________________
> > maker-devel mailing list
> > maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > _______________________________________________
> > maker-devel mailing list
> > maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
maker-devel@...
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Parallelization without MPI.

Carson Hinton Holt
In reply to this post by trs38
Re: [maker-devel] Parallelization without MPI. It’s an IO error, which is more heavily utilized by MPI because multiple processes are accessing the hardrive simultaneously, that is why you won’t normally see it with a stand alone process.  I assume you are writing to NFS which is the Achilles heel of high throughput computation.  Depending on how NFS is set up, in extreme cases the system can say a file was written correctly, only to then say the file doesn’t exist for several seconds after the file was supposable created.  These types of issues become more pronounced as IO usage increases on NFS, MAKER checks for these types of issues, hence the error you are seeing.  The error will be caught and retried, so you should be able to ignore it.  Just check the master_datastore_index.log at the end for DIED_SKIPPED_PERMANENT statements and also verify there are FINISHED statements for all your contigs.  You can also run stand alone maker after your job finishes, if you want to be extra sure.  It will auto-detect all finished contigs and skip them thereby confirming that they were successfully finished.

Thanks,
Carson


On 3/11/11 8:59 AM, "trs38" <trs38@...> wrote:

One further point - these errors only occur when running MPI-Maker,
not when running standalone MAKER.  In case that sheds any light on
anything.
Best regards

On Mar 11, 12:44 pm, trs38 <tr...@...> wrote:
> OK, thank you for the information.  Is the FATAL ERROR separate?  They
> seem to be correlated and the FATAL ERROR seems pretty catastrophic.
> Where would I look to try to get more information on the cause of the
> fatal error?
>
> Cheers.
>
> On Mar 10, 5:28 pm, Carson Holt <carson.h...@...> wrote:
>
>
>
>
>
>
>
>
>
> > To a certain degree you can ignore the error "ERROR: Cannot refresh the lock as it has apparently been broken".  MAKER will automatically retry any failures later.  Just make sure to set the retry value in the control files.
>
> > The error happens because of a delay in your NFS.  This can happen when you have async set as part of the export/mount options.  Basically your NFS mount says the lock was created and returns success even though it really wasn't created, so another process gets the lock at the same time.  I have MAKER double check lock ownership for this very reason (I don't trust the system).  This way MAKER won't step on another processes toes and cause downstream problems.  The error doesn't affect the data or the processing of the data.
>
> > Basically the error means "I thought it was safe for me to process this contig because the system said I have the lock, but apparently another process got the lock as well, so I will move onto the next contig and let the other process take this contig.  I will come back later and check that the other process either finished the contig or released the lock."
>
> > --Carson
>
> > On 3/10/11 10:17 AM, "trs38" <tr...@...> wrote:
>
> > Thank you very much for the helpful reply. I am using Maker 2.09.  MPI
> > is working when all processes run on the same server so your hybrid
> > strategy would work well for me.
>
> > However I've been having some trouble with these errors:
>
> > ERROR: Cannot refresh the lock as it has apparently been broken
>
> > FATAL ERROR
>
> > When attempting to start multiple mpi_maker groups.  I have
> > occasionally been able to do so but can identify no pattern as to when
> > and how this is possible. Would you have any advice for that?
>
> > Thanks again
> > On Mar 9, 3:52 pm, Carson Holt <carson.h...@...> wrote:
>
> > > First, which version of MAKER are you using?
>
> > > Failures can be for multiple reasons.  Check the STDERR output and look for messages with the "ERROR" tag and a longer message.
>
> > > Having multiple FINISHED lines is not a problem, because they do not get rerun.  It is just an indication that they were detected as finished at the start, a message was printed, and then the process moved on.   However, having multiple rather than a single message when running several processes this way  is indicative of an older version of MAKER, hence my first question.
>
> > > In general, you should be able to easily start as many as 20 processes in the same directory without MPI.  Beyond that you will still get parallelization, but at much less efficiency than MPI primarily because of IO overhead.  I often use a mixed strategy, starting multiple MPI processes in the same directory, so it takes takes advantage of both the MPI and the non-MPI parallelization.  As proof of principle and to test the limits of this technique, I have successfully run MAKER on over 1,700 processors simultaneously by starting multiple jobs of 64 processors each.
>
> > > --Carson
>
> > > On 3/9/11 8:13 AM, "trs38" <tr...@...> wrote:
>
> > > My datastore.log shows the same contig as FINISHED many times,
> > > presumably this means that something is going wrong.  How does the
> > > detection work?  Is it on the file system level or does it require
> > > that the processes be running on the same physical server.
>
> > > Many thanks.
>
> > > On Mar 9, 12:50 pm, trs38 <tr...@...> wrote:
>
> > > > I have had difficulty getting MPI to co-operate with our cluster.
> > > > However I read that you:
>
> > > > "Can also start MAKER multiple times and get parallelization without
> > > > MPI. Subsequent MAKER instances will detect already running instances
> > > > and integrate seamlessly"
>
> > > > Since this is discussed so briefly in the tutorial I just wanted to
> > > > confirm that it is true and check whether there is any particular
> > > > advice for how to go about it.  I have been looking at my log files
> > > > and see that quite a few contigs are failing and I wondered if this
> > > > was because some special technique was needed.
>
> > > > Many thanks
>
> > > > _______________________________________________
> > > > maker-devel mailing list
> > > > maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > > _______________________________________________
> > > maker-devel mailing list
> > > maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > > _______________________________________________
> > > maker-devel mailing list
> > > maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > _______________________________________________
> > maker-devel mailing list
> > maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > _______________________________________________
> > maker-devel mailing list
> > maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
maker-devel@...
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Parallelization without MPI.

trs38
Thanks as ever for the help.

I think the problem may be occurring before it even starts processing
any contigs and that this may make it more troublesome.

E.g. this is my entire output:

trs38@node6:~/datavel$ mpiexec.hydra -hosts node6 -n 10 mpi_maker -c 2

ERROR: Cannot refresh the lock as it has apparently been broken

FATAL ERROR

At this point all processes cease to use any CPU and MAKER does not
proceed.  Running the same commands continues to yield the same
response, however occasionally it will work on one host or another and
then will run happily through contigs, it's just getting it started
that's the issue.

I'm managing to make it limp along but it's frustrating not being able
to use the full resources available.

But thank you very much for all the assistance which has been much
appreciated.


On Mar 11, 4:36 pm, Carson Holt <[hidden email]> wrote:

> It's an IO error, which is more heavily utilized by MPI because multiple processes are accessing the hardrive simultaneously, that is why you won't normally see it with a stand alone process.  I assume you are writing to NFS which is the Achilles heel of high throughput computation.  Depending on how NFS is set up, in extreme cases the system can say a file was written correctly, only to then say the file doesn't exist for several seconds after the file was supposable created.  These types of issues become more pronounced as IO usage increases on NFS, MAKER checks for these types of issues, hence the error you are seeing.  The error will be caught and retried, so you should be able to ignore it.  Just check the master_datastore_index.log at the end for DIED_SKIPPED_PERMANENT statements and also verify there are FINISHED statements for all your contigs.  You can also run stand alone maker after your job finishes, if you want to be extra sure.  It will auto-detect all finished contigs and skip them thereby confirming that they were successfully finished.
>
> Thanks,
> Carson
>
> On 3/11/11 8:59 AM, "trs38" <[hidden email]> wrote:
>
> One further point - these errors only occur when running MPI-Maker,
> not when running standalone MAKER.  In case that sheds any light on
> anything.
> Best regards
>
> On Mar 11, 12:44 pm, trs38 <[hidden email]> wrote:
>
>
>
>
>
>
>
>
>
> > OK, thank you for the information.  Is the FATAL ERROR separate?  They
> > seem to be correlated and the FATAL ERROR seems pretty catastrophic.
> > Where would I look to try to get more information on the cause of the
> > fatal error?
>
> > Cheers.
>
> > On Mar 10, 5:28 pm, Carson Holt <[hidden email]> wrote:
>
> > > To a certain degree you can ignore the error "ERROR: Cannot refresh the lock as it has apparently been broken".  MAKER will automatically retry any failures later.  Just make sure to set the retry value in the control files.
>
> > > The error happens because of a delay in your NFS.  This can happen when you have async set as part of the export/mount options.  Basically your NFS mount says the lock was created and returns success even though it really wasn't created, so another process gets the lock at the same time.  I have MAKER double check lock ownership for this very reason (I don't trust the system).  This way MAKER won't step on another processes toes and cause downstream problems.  The error doesn't affect the data or the processing of the data.
>
> > > Basically the error means "I thought it was safe for me to process this contig because the system said I have the lock, but apparently another process got the lock as well, so I will move onto the next contig and let the other process take this contig.  I will come back later and check that the other process either finished the contig or released the lock."
>
> > > --Carson
>
> > > On 3/10/11 10:17 AM, "trs38" <[hidden email]> wrote:
>
> > > Thank you very much for the helpful reply. I am using Maker 2.09.  MPI
> > > is working when all processes run on the same server so your hybrid
> > > strategy would work well for me.
>
> > > However I've been having some trouble with these errors:
>
> > > ERROR: Cannot refresh the lock as it has apparently been broken
>
> > > FATAL ERROR
>
> > > When attempting to start multiple mpi_maker groups.  I have
> > > occasionally been able to do so but can identify no pattern as to when
> > > and how this is possible. Would you have any advice for that?
>
> > > Thanks again
> > > On Mar 9, 3:52 pm, Carson Holt <[hidden email]> wrote:
>
> > > > First, which version of MAKER are you using?
>
> > > > Failures can be for multiple reasons.  Check the STDERR output and look for messages with the "ERROR" tag and a longer message.
>
> > > > Having multiple FINISHED lines is not a problem, because they do not get rerun.  It is just an indication that they were detected as finished at the start, a message was printed, and then the process moved on.   However, having multiple rather than a single message when running several processes this way  is indicative of an older version of MAKER, hence my first question.
>
> > > > In general, you should be able to easily start as many as 20 processes in the same directory without MPI.  Beyond that you will still get parallelization, but at much less efficiency than MPI primarily because of IO overhead.  I often use a mixed strategy, starting multiple MPI processes in the same directory, so it takes takes advantage of both the MPI and the non-MPI parallelization.  As proof of principle and to test the limits of this technique, I have successfully run MAKER on over 1,700 processors simultaneously by starting multiple jobs of 64 processors each.
>
> > > > --Carson
>
> > > > On 3/9/11 8:13 AM, "trs38" <[hidden email]> wrote:
>
> > > > My datastore.log shows the same contig as FINISHED many times,
> > > > presumably this means that something is going wrong.  How does the
> > > > detection work?  Is it on the file system level or does it require
> > > > that the processes be running on the same physical server.
>
> > > > Many thanks.
>
> > > > On Mar 9, 12:50 pm, trs38 <[hidden email]> wrote:
>
> > > > > I have had difficulty getting MPI to co-operate with our cluster.
> > > > > However I read that you:
>
> > > > > "Can also start MAKER multiple times and get parallelization without
> > > > > MPI. Subsequent MAKER instances will detect already running instances
> > > > > and integrate seamlessly"
>
> > > > > Since this is discussed so briefly in the tutorial I just wanted to
> > > > > confirm that it is true and check whether there is any particular
> > > > > advice for how to go about it.  I have been looking at my log files
> > > > > and see that quite a few contigs are failing and I wondered if this
> > > > > was because some special technique was needed.
>
> > > > > Many thanks
>
> > > > > _______________________________________________
> > > > > maker-devel mailing list
> > > > > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > > > _______________________________________________
> > > > maker-devel mailing list
> > > > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > > > _______________________________________________
> > > > maker-devel mailing list
> > > > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > > _______________________________________________
> > > maker-devel mailing list
> > > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > > _______________________________________________
> > > maker-devel mailing list
> > > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > _______________________________________________
> > maker-devel mailing list
> > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Parallelization without MPI.

trs38
In reply to this post by Carson Hinton Holt
I'm told that we do not use async, so that's not the particular
problem.

On Mar 11, 4:36 pm, Carson Holt <[hidden email]> wrote:

> It's an IO error, which is more heavily utilized by MPI because multiple processes are accessing the hardrive simultaneously, that is why you won't normally see it with a stand alone process.  I assume you are writing to NFS which is the Achilles heel of high throughput computation.  Depending on how NFS is set up, in extreme cases the system can say a file was written correctly, only to then say the file doesn't exist for several seconds after the file was supposable created.  These types of issues become more pronounced as IO usage increases on NFS, MAKER checks for these types of issues, hence the error you are seeing.  The error will be caught and retried, so you should be able to ignore it.  Just check the master_datastore_index.log at the end for DIED_SKIPPED_PERMANENT statements and also verify there are FINISHED statements for all your contigs.  You can also run stand alone maker after your job finishes, if you want to be extra sure.  It will auto-detect all finished contigs and skip them thereby confirming that they were successfully finished.
>
> Thanks,
> Carson
>
> On 3/11/11 8:59 AM, "trs38" <[hidden email]> wrote:
>
> One further point - these errors only occur when running MPI-Maker,
> not when running standalone MAKER.  In case that sheds any light on
> anything.
> Best regards
>
> On Mar 11, 12:44 pm, trs38 <[hidden email]> wrote:
>
>
>
>
>
>
>
>
>
> > OK, thank you for the information.  Is the FATAL ERROR separate?  They
> > seem to be correlated and the FATAL ERROR seems pretty catastrophic.
> > Where would I look to try to get more information on the cause of the
> > fatal error?
>
> > Cheers.
>
> > On Mar 10, 5:28 pm, Carson Holt <[hidden email]> wrote:
>
> > > To a certain degree you can ignore the error "ERROR: Cannot refresh the lock as it has apparently been broken".  MAKER will automatically retry any failures later.  Just make sure to set the retry value in the control files.
>
> > > The error happens because of a delay in your NFS.  This can happen when you have async set as part of the export/mount options.  Basically your NFS mount says the lock was created and returns success even though it really wasn't created, so another process gets the lock at the same time.  I have MAKER double check lock ownership for this very reason (I don't trust the system).  This way MAKER won't step on another processes toes and cause downstream problems.  The error doesn't affect the data or the processing of the data.
>
> > > Basically the error means "I thought it was safe for me to process this contig because the system said I have the lock, but apparently another process got the lock as well, so I will move onto the next contig and let the other process take this contig.  I will come back later and check that the other process either finished the contig or released the lock."
>
> > > --Carson
>
> > > On 3/10/11 10:17 AM, "trs38" <[hidden email]> wrote:
>
> > > Thank you very much for the helpful reply. I am using Maker 2.09.  MPI
> > > is working when all processes run on the same server so your hybrid
> > > strategy would work well for me.
>
> > > However I've been having some trouble with these errors:
>
> > > ERROR: Cannot refresh the lock as it has apparently been broken
>
> > > FATAL ERROR
>
> > > When attempting to start multiple mpi_maker groups.  I have
> > > occasionally been able to do so but can identify no pattern as to when
> > > and how this is possible. Would you have any advice for that?
>
> > > Thanks again
> > > On Mar 9, 3:52 pm, Carson Holt <[hidden email]> wrote:
>
> > > > First, which version of MAKER are you using?
>
> > > > Failures can be for multiple reasons.  Check the STDERR output and look for messages with the "ERROR" tag and a longer message.
>
> > > > Having multiple FINISHED lines is not a problem, because they do not get rerun.  It is just an indication that they were detected as finished at the start, a message was printed, and then the process moved on.   However, having multiple rather than a single message when running several processes this way  is indicative of an older version of MAKER, hence my first question.
>
> > > > In general, you should be able to easily start as many as 20 processes in the same directory without MPI.  Beyond that you will still get parallelization, but at much less efficiency than MPI primarily because of IO overhead.  I often use a mixed strategy, starting multiple MPI processes in the same directory, so it takes takes advantage of both the MPI and the non-MPI parallelization.  As proof of principle and to test the limits of this technique, I have successfully run MAKER on over 1,700 processors simultaneously by starting multiple jobs of 64 processors each.
>
> > > > --Carson
>
> > > > On 3/9/11 8:13 AM, "trs38" <[hidden email]> wrote:
>
> > > > My datastore.log shows the same contig as FINISHED many times,
> > > > presumably this means that something is going wrong.  How does the
> > > > detection work?  Is it on the file system level or does it require
> > > > that the processes be running on the same physical server.
>
> > > > Many thanks.
>
> > > > On Mar 9, 12:50 pm, trs38 <[hidden email]> wrote:
>
> > > > > I have had difficulty getting MPI to co-operate with our cluster.
> > > > > However I read that you:
>
> > > > > "Can also start MAKER multiple times and get parallelization without
> > > > > MPI. Subsequent MAKER instances will detect already running instances
> > > > > and integrate seamlessly"
>
> > > > > Since this is discussed so briefly in the tutorial I just wanted to
> > > > > confirm that it is true and check whether there is any particular
> > > > > advice for how to go about it.  I have been looking at my log files
> > > > > and see that quite a few contigs are failing and I wondered if this
> > > > > was because some special technique was needed.
>
> > > > > Many thanks
>
> > > > > _______________________________________________
> > > > > maker-devel mailing list
> > > > > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > > > _______________________________________________
> > > > maker-devel mailing list
> > > > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > > > _______________________________________________
> > > > maker-devel mailing list
> > > > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > > _______________________________________________
> > > maker-devel mailing list
> > > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > > _______________________________________________
> > > maker-devel mailing list
> > > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > _______________________________________________
> > maker-devel mailing list
> > [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Parallelization without MPI.

Carson Hinton Holt
In reply to this post by trs38
Re: [maker-devel] Parallelization without MPI. Even without async, NFS can still cause issues depending on the setup, and everything indicates IO issues.

Also there is one other issue that may also be affecting your runs and can affect file locking.  It is related to the new hydra process manager in MPI.  So hydra replaced MPD in the latest MPICH2 release because it is more fault tolerant, its basically impossible to kill.  Unfortunately hydra appears to be too fault tolerant as it can sometimes maintain processes that fail or that you kill.  The result is a series of defunct processes that can maintain locks and as a result keep subsequent processes from accessing needed files.  To see if there are defunct processes being held onto by hydra, type --> ps -AF | grep “mpi_maker” --> then look for the ‘defunct’ label after the process.

Side note: don’t use MVAPICH2 with MAKER, use MPICH2.

Also try starting one maker job on one node, then wait for it to start processing, then start another job on another node.  There may be issues with your IO causing it to slow during initialization, and by only initializing one at a time you can avoid the issue.  The low CPU usage makes me think that MAKER is not just frozen.  A real failure in MPI will actually cause a spike in CPU usage, this is because a blocking message pass in MPI use 100% cpu when there is no response between nodes.  So a frozen process would have 100% cpu usage and no output.

--Carson



On 3/11/11 10:32 AM, "trs38" <trs38@...> wrote:

Thanks as ever for the help.

I think the problem may be occurring before it even starts processing
any contigs and that this may make it more troublesome.

E.g. this is my entire output:

trs38@node6:~/datavel$ mpiexec.hydra -hosts node6 -n 10 mpi_maker -c 2

ERROR: Cannot refresh the lock as it has apparently been broken

FATAL ERROR

At this point all processes cease to use any CPU and MAKER does not
proceed.  Running the same commands continues to yield the same
response, however occasionally it will work on one host or another and
then will run happily through contigs, it's just getting it started
that's the issue.

I'm managing to make it limp along but it's frustrating not being able
to use the full resources available.

But thank you very much for all the assistance which has been much
appreciated.


On Mar 11, 4:36 pm, Carson Holt <carson.h...@...> wrote:
> It's an IO error, which is more heavily utilized by MPI because multiple processes are accessing the hardrive simultaneously, that is why you won't normally see it with a stand alone process.  I assume you are writing to NFS which is the Achilles heel of high throughput computation.  Depending on how NFS is set up, in extreme cases the system can say a file was written correctly, only to then say the file doesn't exist for several seconds after the file was supposable created.  These types of issues become more pronounced as IO usage increases on NFS, MAKER checks for these types of issues, hence the error you are seeing.  The error will be caught and retried, so you should be able to ignore it.  Just check the master_datastore_index.log at the end for DIED_SKIPPED_PERMANENT statements and also verify there are FINISHED statements for all your contigs.  You can also run stand alone maker after your job finishes, if you want to be extra sure.  It will auto-detect all finished contigs and skip them thereby confirming that they were successfully finished.
>
> Thanks,
> Carson
>
> On 3/11/11 8:59 AM, "trs38" <tr...@...> wrote:
>
> One further point - these errors only occur when running MPI-Maker,
> not when running standalone MAKER.  In case that sheds any light on
> anything.
> Best regards
>
> On Mar 11, 12:44 pm, trs38 <tr...@...> wrote:
>
>
>
>
>
>
>
>
>
> > OK, thank you for the information.  Is the FATAL ERROR separate?  They
> > seem to be correlated and the FATAL ERROR seems pretty catastrophic.
> > Where would I look to try to get more information on the cause of the
> > fatal error?
>
> > Cheers.
>
> > On Mar 10, 5:28 pm, Carson Holt <carson.h...@...> wrote:
>
> > > To a certain degree you can ignore the error "ERROR: Cannot refresh the lock as it has apparently been broken".  MAKER will automatically retry any failures later.  Just make sure to set the retry value in the control files.
>
> > > The error happens because of a delay in your NFS.  This can happen when you have async set as part of the export/mount options.  Basically your NFS mount says the lock was created and returns success even though it really wasn't created, so another process gets the lock at the same time.  I have MAKER double check lock ownership for this very reason (I don't trust the system).  This way MAKER won't step on another processes toes and cause downstream problems.  The error doesn't affect the data or the processing of the data.
>
> > > Basically the error means "I thought it was safe for me to process this contig because the system said I have the lock, but apparently another process got the lock as well, so I will move onto the next contig and let the other process take this contig.  I will come back later and check that the other process either finished the contig or released the lock."
>
> > > --Carson
>
> > > On 3/10/11 10:17 AM, "trs38" <tr...@...> wrote:
>
> > > Thank you very much for the helpful reply. I am using Maker 2.09.  MPI
> > > is working when all processes run on the same server so your hybrid
> > > strategy would work well for me.
>
> > > However I've been having some trouble with these errors:
>
> > > ERROR: Cannot refresh the lock as it has apparently been broken
>
> > > FATAL ERROR
>
> > > When attempting to start multiple mpi_maker groups.  I have
> > > occasionally been able to do so but can identify no pattern as to when
> > > and how this is possible. Would you have any advice for that?
>
> > > Thanks again
> > > On Mar 9, 3:52 pm, Carson Holt <carson.h...@...> wrote:
>
> > > > First, which version of MAKER are you using?
>
> > > > Failures can be for multiple reasons.  Check the STDERR output and look for messages with the "ERROR" tag and a longer message.
>
> > > > Having multiple FINISHED lines is not a problem, because they do not get rerun.  It is just an indication that they were detected as finished at the start, a message was printed, and then the process moved on.   However, having multiple rather than a single message when running several processes this way  is indicative of an older version of MAKER, hence my first question.
>
> > > > In general, you should be able to easily start as many as 20 processes in the same directory without MPI.  Beyond that you will still get parallelization, but at much less efficiency than MPI primarily because of IO overhead.  I often use a mixed strategy, starting multiple MPI processes in the same directory, so it takes takes advantage of both the MPI and the non-MPI parallelization.  As proof of principle and to test the limits of this technique, I have successfully run MAKER on over 1,700 processors simultaneously by starting multiple jobs of 64 processors each.
>
> > > > --Carson
>
> > > > On 3/9/11 8:13 AM, "trs38" <tr...@...> wrote:
>
> > > > My datastore.log shows the same contig as FINISHED many times,
> > > > presumably this means that something is going wrong.  How does the
> > > > detection work?  Is it on the file system level or does it require
> > > > that the processes be running on the same physical server.
>
> > > > Many thanks.
>
> > > > On Mar 9, 12:50 pm, trs38 <tr...@...> wrote:
>
> > > > > I have had difficulty getting MPI to co-operate with our cluster.
> > > > > However I read that you:
>
> > > > > "Can also start MAKER multiple times and get parallelization without
> > > > > MPI. Subsequent MAKER instances will detect already running instances
> > > > > and integrate seamlessly"
>
> > > > > Since this is discussed so briefly in the tutorial I just wanted to
> > > > > confirm that it is true and check whether there is any particular
> > > > > advice for how to go about it.  I have been looking at my log files
> > > > > and see that quite a few contigs are failing and I wondered if this
> > > > > was because some special technique was needed.
>
> > > > > Many thanks
>
> > > > > _______________________________________________
> > > > > maker-devel mailing list
> > > > > maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > > > _______________________________________________
> > > > maker-devel mailing list
> > > > maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > > > _______________________________________________
> > > > maker-devel mailing list
> > > > maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > > _______________________________________________
> > > maker-devel mailing list
> > > maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > > _______________________________________________
> > > maker-devel mailing list
> > > maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> > _______________________________________________
> > maker-devel mailing list
> > maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> maker-de...@...://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
maker-devel@...
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org