RunWorkflow not identifying Aborted SGE jobs?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

RunWorkflow not identifying Aborted SGE jobs?

Aaron Gussman
Does anyone have experience with RunWorkflow not identifying that a
job has been killed by SGE?

The workflow event log is as follows:

I~~~prolog starting
I~~~htc id   sge id[.task id]   date   message   hostname
S~~~49393901~~~8668777~~~Wed Feb  9 23:46:27 EST 2011~~~job started on
workflow for idxseq~~~sge134.be-md.ncbi.nlm.nih.gov
I~~~prolog ending
I~~~ wrapper script starting job
I~~~ Job Process id is 8641
I~~~epilog starting
I~~~htc id   sge id[.task id]   date   message   hostname
T~~~49393901~~~8668777~~~Thu Feb 10 02:46:29 EST 2011~~~job finished
on workflow for idxseq~~~sge134.be-md.ncbi.nlm.nih.gov
I~~~epilog ending

I believe the job ran too long and was automatically killed by SGE.

I'm using workflow 3.1.4.
I recall Anup mentioning at one point that some versions of
RunWorkflow would keep running unless there was a specific failure
notification in the event.log.  Could that be what's going on here?

Thanks for any advice or assistance,
Aaron

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users
Reply | Threaded
Open this post in threaded view
|

Re: RunWorkflow not identifying Aborted SGE jobs?

Mahurkar, Anup
Aaron,

In this case wf-3.1.4 should have picked up the unusual termination of the job (the 'T' line) without an 'F' line. Did this not happen?

-----Original Message-----
From: Aaron Gussman [mailto:[hidden email]]
Sent: Thursday, February 10, 2011 12:10 PM
To: [hidden email]
Subject: [Ergatis-users] RunWorkflow not identifying Aborted SGE jobs?

Does anyone have experience with RunWorkflow not identifying that a
job has been killed by SGE?

The workflow event log is as follows:

I~~~prolog starting
I~~~htc id   sge id[.task id]   date   message   hostname
S~~~49393901~~~8668777~~~Wed Feb  9 23:46:27 EST 2011~~~job started on
workflow for idxseq~~~sge134.be-md.ncbi.nlm.nih.gov
I~~~prolog ending
I~~~ wrapper script starting job
I~~~ Job Process id is 8641
I~~~epilog starting
I~~~htc id   sge id[.task id]   date   message   hostname
T~~~49393901~~~8668777~~~Thu Feb 10 02:46:29 EST 2011~~~job finished
on workflow for idxseq~~~sge134.be-md.ncbi.nlm.nih.gov
I~~~epilog ending

I believe the job ran too long and was automatically killed by SGE.

I'm using workflow 3.1.4.
I recall Anup mentioning at one point that some versions of
RunWorkflow would keep running unless there was a specific failure
notification in the event.log.  Could that be what's going on here?

Thanks for any advice or assistance,
Aaron

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users
Reply | Threaded
Open this post in threaded view
|

Re: RunWorkflow not identifying Aborted SGE jobs?

Aaron Gussman
Hi Anup,
  The pipeline.xml RunWorkflow is still running and reporting
marshalling info to pipeline.xml.log.

  Are there any other log files I should check?

Thanks,
Aaron

On Thu, Feb 10, 2011 at 1:48 PM, Mahurkar, Anup
<[hidden email]> wrote:

> Aaron,
>
> In this case wf-3.1.4 should have picked up the unusual termination of the job (the 'T' line) without an 'F' line. Did this not happen?
>
> -----Original Message-----
> From: Aaron Gussman [mailto:[hidden email]]
> Sent: Thursday, February 10, 2011 12:10 PM
> To: [hidden email]
> Subject: [Ergatis-users] RunWorkflow not identifying Aborted SGE jobs?
>
> Does anyone have experience with RunWorkflow not identifying that a
> job has been killed by SGE?
>
> The workflow event log is as follows:
>
> I~~~prolog starting
> I~~~htc id   sge id[.task id]   date   message   hostname
> S~~~49393901~~~8668777~~~Wed Feb  9 23:46:27 EST 2011~~~job started on
> workflow for idxseq~~~sge134.be-md.ncbi.nlm.nih.gov
> I~~~prolog ending
> I~~~ wrapper script starting job
> I~~~ Job Process id is 8641
> I~~~epilog starting
> I~~~htc id   sge id[.task id]   date   message   hostname
> T~~~49393901~~~8668777~~~Thu Feb 10 02:46:29 EST 2011~~~job finished
> on workflow for idxseq~~~sge134.be-md.ncbi.nlm.nih.gov
> I~~~epilog ending
>
> I believe the job ran too long and was automatically killed by SGE.
>
> I'm using workflow 3.1.4.
> I recall Anup mentioning at one point that some versions of
> RunWorkflow would keep running unless there was a specific failure
> notification in the event.log.  Could that be what's going on here?
>
> Thanks for any advice or assistance,
> Aaron
>
> ------------------------------------------------------------------------------
> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
> Pinpoint memory and threading errors before they happen.
> Find and fix more than 250 security defects in the development cycle.
> Locate bottlenecks in serial and parallel code that limit performance.
> http://p.sf.net/sfu/intel-dev2devfeb
> _______________________________________________
> Ergatis-users mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/ergatis-users
>
> ------------------------------------------------------------------------------
> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
> Pinpoint memory and threading errors before they happen.
> Find and fix more than 250 security defects in the development cycle.
> Locate bottlenecks in serial and parallel code that limit performance.
> http://p.sf.net/sfu/intel-dev2devfeb
> _______________________________________________
> Ergatis-users mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/ergatis-users
>

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users
Reply | Threaded
Open this post in threaded view
|

Re: RunWorkflow not identifying Aborted SGE jobs?

Aaron Gussman
Hi Anup, ergatis users,
It happened again, where the SGE job was aborted but RunWorkflow
didn't pick up on it.

The event.log is attached.  It has a 'T' line but no 'F' line.

In the sge_job.sh, all references to workflow are the 3.1.4 version.
I can pass this along as well, if it would be useful.

Is it possible workflow is misconfigured?

Thanks for any advice,
Aaron

On Thu, Feb 10, 2011 at 2:40 PM, Aaron Gussman <[hidden email]> wrote:

> Hi Anup,
>  The pipeline.xml RunWorkflow is still running and reporting
> marshalling info to pipeline.xml.log.
>
>  Are there any other log files I should check?
>
> Thanks,
> Aaron
>
> On Thu, Feb 10, 2011 at 1:48 PM, Mahurkar, Anup
> <[hidden email]> wrote:
>> Aaron,
>>
>> In this case wf-3.1.4 should have picked up the unusual termination of the job (the 'T' line) without an 'F' line. Did this not happen?
>>
>> -----Original Message-----
>> From: Aaron Gussman [mailto:[hidden email]]
>> Sent: Thursday, February 10, 2011 12:10 PM
>> To: [hidden email]
>> Subject: [Ergatis-users] RunWorkflow not identifying Aborted SGE jobs?
>>
>> Does anyone have experience with RunWorkflow not identifying that a
>> job has been killed by SGE?
>>
>> The workflow event log is as follows:
>>
>> I~~~prolog starting
>> I~~~htc id   sge id[.task id]   date   message   hostname
>> S~~~49393901~~~8668777~~~Wed Feb  9 23:46:27 EST 2011~~~job started on
>> workflow for idxseq~~~sge134.be-md.ncbi.nlm.nih.gov
>> I~~~prolog ending
>> I~~~ wrapper script starting job
>> I~~~ Job Process id is 8641
>> I~~~epilog starting
>> I~~~htc id   sge id[.task id]   date   message   hostname
>> T~~~49393901~~~8668777~~~Thu Feb 10 02:46:29 EST 2011~~~job finished
>> on workflow for idxseq~~~sge134.be-md.ncbi.nlm.nih.gov
>> I~~~epilog ending
>>
>> I believe the job ran too long and was automatically killed by SGE.
>>
>> I'm using workflow 3.1.4.
>> I recall Anup mentioning at one point that some versions of
>> RunWorkflow would keep running unless there was a specific failure
>> notification in the event.log.  Could that be what's going on here?
>>
>> Thanks for any advice or assistance,
>> Aaron
>>
>> ------------------------------------------------------------------------------
>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
>> Pinpoint memory and threading errors before they happen.
>> Find and fix more than 250 security defects in the development cycle.
>> Locate bottlenecks in serial and parallel code that limit performance.
>> http://p.sf.net/sfu/intel-dev2devfeb
>> _______________________________________________
>> Ergatis-users mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/ergatis-users
>>
>> ------------------------------------------------------------------------------
>> The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
>> Pinpoint memory and threading errors before they happen.
>> Find and fix more than 250 security defects in the development cycle.
>> Locate bottlenecks in serial and parallel code that limit performance.
>> http://p.sf.net/sfu/intel-dev2devfeb
>> _______________________________________________
>> Ergatis-users mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/ergatis-users
>>
>

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users

event.log (688 bytes) Download Attachment