Workflow not detecting Aborted SGE jobs?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Workflow not detecting Aborted SGE jobs?

Aaron Gussman
Has anyone encountered an issue with workflow not detecting that an
SGE job was Aborted?  It looks like the Workflow instance(s) just keep
running and never detect that the job(s) are gone or update the
pipeline.xml.  I have to go in and kill them manually on the Ergatis
master.

I think other SGE-job-removal situations might cause the same issue,
but Aborted jobs are where I've consistently noticed the problem.

Is this a known issue with Workflow, or do I maybe have something
configured incorrectly?

Thanks,
Aaron

------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users
Reply | Threaded
Open this post in threaded view
|

Re: Workflow not detecting Aborted SGE jobs?

Joshua Orvis
When this happens, what is written in the prolog output for that job?

JO

On Wed, Oct 20, 2010 at 10:46 AM, Aaron Gussman <[hidden email]> wrote:
Has anyone encountered an issue with workflow not detecting that an
SGE job was Aborted?  It looks like the Workflow instance(s) just keep
running and never detect that the job(s) are gone or update the
pipeline.xml.  I have to go in and kill them manually on the Ergatis
master.

I think other SGE-job-removal situations might cause the same issue,
but Aborted jobs are where I've consistently noticed the problem.

Is this a known issue with Workflow, or do I maybe have something
configured incorrectly?

Thanks,
Aaron

------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users


------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users
Reply | Threaded
Open this post in threaded view
|

Re: Workflow not detecting Aborted SGE jobs?

Mahurkar, Anup
In reply to this post by Aaron Gussman
Aaron,

If I recall correctly this issue has been fixed in the one of the newer
versions of Workflow. But make sure we are talking about the same issue
could you send me the event.log?

There was a time when we were expecting the finish line starts with 'F' and
if we did not detect it we kept on waiting. At some point I fixed it where I
also look for a 'T' line which is written by the prolog and if I see it
without an 'F' line I error that step. So looking at the event.log file will
give me the clue.


On 10/20/10 11:46 AM, "Aaron Gussman" <[hidden email]> wrote:

> Has anyone encountered an issue with workflow not detecting that an
> SGE job was Aborted?  It looks like the Workflow instance(s) just keep
> running and never detect that the job(s) are gone or update the
> pipeline.xml.  I have to go in and kill them manually on the Ergatis
> master.
>
> I think other SGE-job-removal situations might cause the same issue,
> but Aborted jobs are where I've consistently noticed the problem.
>
> Is this a known issue with Workflow, or do I maybe have something
> configured incorrectly?
>
> Thanks,
> Aaron
>
> ------------------------------------------------------------------------------
> Nokia and AT&T present the 2010 Calling All Innovators-North America contest
> Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
> $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
> Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
> http://p.sf.net/sfu/nokia-dev2dev
> _______________________________________________
> Ergatis-users mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/ergatis-users


------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users
Reply | Threaded
Open this post in threaded view
|

Re: Workflow not detecting Aborted SGE jobs?

Aaron Gussman
Hi Anup,

Looks like it's missing the 'F' line, but from the epilog.  Here you go:

[pmadm@ebuild02 110748]$ cat event.log
I~~~prolog starting
I~~~htc id   sge id[.task id]   date   message   hostname
S~~~110748~~~3271144~~~Wed Oct 20 11:18:47 EDT 2010~~~job started on
workflow for pmadm~~~sge411.be-md.ncbi.nlm.nih.gov
I~~~prolog ending
I~~~ wrapper script starting job
I~~~ Job Process id is 25255
I~~~epilog starting
I~~~htc id   sge id[.task id]   date   message   hostname
T~~~110748~~~3271144~~~Wed Oct 20 11:18:52 EDT 2010~~~job finished on
workflow for pmadm~~~sge411.be-md.ncbi.nlm.nih.gov
I~~~epilog ending
[pmadm@ebuild02 110748]$ cat sge_submit.out
Your job 3271144 ("RunWorkflow") has been submitted



On Wed, Oct 20, 2010 at 11:56 AM, Mahurkar, Anup
<[hidden email]> wrote:

> Aaron,
>
> If I recall correctly this issue has been fixed in the one of the newer
> versions of Workflow. But make sure we are talking about the same issue
> could you send me the event.log?
>
> There was a time when we were expecting the finish line starts with 'F' and
> if we did not detect it we kept on waiting. At some point I fixed it where I
> also look for a 'T' line which is written by the prolog and if I see it
> without an 'F' line I error that step. So looking at the event.log file will
> give me the clue.
>
>
> On 10/20/10 11:46 AM, "Aaron Gussman" <[hidden email]> wrote:
>
>> Has anyone encountered an issue with workflow not detecting that an
>> SGE job was Aborted?  It looks like the Workflow instance(s) just keep
>> running and never detect that the job(s) are gone or update the
>> pipeline.xml.  I have to go in and kill them manually on the Ergatis
>> master.
>>
>> I think other SGE-job-removal situations might cause the same issue,
>> but Aborted jobs are where I've consistently noticed the problem.
>>
>> Is this a known issue with Workflow, or do I maybe have something
>> configured incorrectly?
>>
>> Thanks,
>> Aaron
>>
>> ------------------------------------------------------------------------------
>> Nokia and AT&T present the 2010 Calling All Innovators-North America contest
>> Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
>> $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
>> Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
>> http://p.sf.net/sfu/nokia-dev2dev
>> _______________________________________________
>> Ergatis-users mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/ergatis-users
>
>
> ------------------------------------------------------------------------------
> Nokia and AT&T present the 2010 Calling All Innovators-North America contest
> Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
> $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
> Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
> http://p.sf.net/sfu/nokia-dev2dev
> _______________________________________________
> Ergatis-users mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/ergatis-users
>

------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users
Reply | Threaded
Open this post in threaded view
|

Re: Workflow not detecting Aborted SGE jobs?

Mahurkar, Anup
As I suspected, I think the binary probably seg faulted so the wrapper script did not write the 'F' line. This has been fixed in version >3.1.2. So we should work on upgrading you. I will chat with Victor to post the new stored procedure so you can get the 3.1.4 version working.

-----Original Message-----
From: Aaron Gussman [mailto:[hidden email]]
Sent: Wednesday, October 20, 2010 11:59 AM
To: [hidden email]; Anup Mahurkar
Subject: Re: [Ergatis-users] Workflow not detecting Aborted SGE jobs?

Hi Anup,

Looks like it's missing the 'F' line, but from the epilog.  Here you go:

[pmadm@ebuild02 110748]$ cat event.log
I~~~prolog starting
I~~~htc id   sge id[.task id]   date   message   hostname
S~~~110748~~~3271144~~~Wed Oct 20 11:18:47 EDT 2010~~~job started on
workflow for pmadm~~~sge411.be-md.ncbi.nlm.nih.gov
I~~~prolog ending
I~~~ wrapper script starting job
I~~~ Job Process id is 25255
I~~~epilog starting
I~~~htc id   sge id[.task id]   date   message   hostname
T~~~110748~~~3271144~~~Wed Oct 20 11:18:52 EDT 2010~~~job finished on
workflow for pmadm~~~sge411.be-md.ncbi.nlm.nih.gov
I~~~epilog ending
[pmadm@ebuild02 110748]$ cat sge_submit.out
Your job 3271144 ("RunWorkflow") has been submitted



On Wed, Oct 20, 2010 at 11:56 AM, Mahurkar, Anup
<[hidden email]> wrote:

> Aaron,
>
> If I recall correctly this issue has been fixed in the one of the newer
> versions of Workflow. But make sure we are talking about the same issue
> could you send me the event.log?
>
> There was a time when we were expecting the finish line starts with 'F' and
> if we did not detect it we kept on waiting. At some point I fixed it where I
> also look for a 'T' line which is written by the prolog and if I see it
> without an 'F' line I error that step. So looking at the event.log file will
> give me the clue.
>
>
> On 10/20/10 11:46 AM, "Aaron Gussman" <[hidden email]> wrote:
>
>> Has anyone encountered an issue with workflow not detecting that an
>> SGE job was Aborted?  It looks like the Workflow instance(s) just keep
>> running and never detect that the job(s) are gone or update the
>> pipeline.xml.  I have to go in and kill them manually on the Ergatis
>> master.
>>
>> I think other SGE-job-removal situations might cause the same issue,
>> but Aborted jobs are where I've consistently noticed the problem.
>>
>> Is this a known issue with Workflow, or do I maybe have something
>> configured incorrectly?
>>
>> Thanks,
>> Aaron
>>
>> ------------------------------------------------------------------------------
>> Nokia and AT&T present the 2010 Calling All Innovators-North America contest
>> Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
>> $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
>> Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
>> http://p.sf.net/sfu/nokia-dev2dev
>> _______________________________________________
>> Ergatis-users mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/ergatis-users
>
>
> ------------------------------------------------------------------------------
> Nokia and AT&T present the 2010 Calling All Innovators-North America contest
> Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
> $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
> Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
> http://p.sf.net/sfu/nokia-dev2dev
> _______________________________________________
> Ergatis-users mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/ergatis-users
>

------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users

------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users
Reply | Threaded
Open this post in threaded view
|

Re: Workflow not detecting Aborted SGE jobs?

Aaron Gussman
Thanks Anup, I appreciate the help.
-Aaron

On Wed, Oct 20, 2010 at 2:00 PM, Mahurkar, Anup
<[hidden email]> wrote:

> As I suspected, I think the binary probably seg faulted so the wrapper script did not write the 'F' line. This has been fixed in version >3.1.2. So we should work on upgrading you. I will chat with Victor to post the new stored procedure so you can get the 3.1.4 version working.
>
> -----Original Message-----
> From: Aaron Gussman [mailto:[hidden email]]
> Sent: Wednesday, October 20, 2010 11:59 AM
> To: [hidden email]; Anup Mahurkar
> Subject: Re: [Ergatis-users] Workflow not detecting Aborted SGE jobs?
>
> Hi Anup,
>
> Looks like it's missing the 'F' line, but from the epilog.  Here you go:
>
> [pmadm@ebuild02 110748]$ cat event.log
> I~~~prolog starting
> I~~~htc id   sge id[.task id]   date   message   hostname
> S~~~110748~~~3271144~~~Wed Oct 20 11:18:47 EDT 2010~~~job started on
> workflow for pmadm~~~sge411.be-md.ncbi.nlm.nih.gov
> I~~~prolog ending
> I~~~ wrapper script starting job
> I~~~ Job Process id is 25255
> I~~~epilog starting
> I~~~htc id   sge id[.task id]   date   message   hostname
> T~~~110748~~~3271144~~~Wed Oct 20 11:18:52 EDT 2010~~~job finished on
> workflow for pmadm~~~sge411.be-md.ncbi.nlm.nih.gov
> I~~~epilog ending
> [pmadm@ebuild02 110748]$ cat sge_submit.out
> Your job 3271144 ("RunWorkflow") has been submitted
>
>
>
> On Wed, Oct 20, 2010 at 11:56 AM, Mahurkar, Anup
> <[hidden email]> wrote:
>> Aaron,
>>
>> If I recall correctly this issue has been fixed in the one of the newer
>> versions of Workflow. But make sure we are talking about the same issue
>> could you send me the event.log?
>>
>> There was a time when we were expecting the finish line starts with 'F' and
>> if we did not detect it we kept on waiting. At some point I fixed it where I
>> also look for a 'T' line which is written by the prolog and if I see it
>> without an 'F' line I error that step. So looking at the event.log file will
>> give me the clue.
>>
>>
>> On 10/20/10 11:46 AM, "Aaron Gussman" <[hidden email]> wrote:
>>
>>> Has anyone encountered an issue with workflow not detecting that an
>>> SGE job was Aborted?  It looks like the Workflow instance(s) just keep
>>> running and never detect that the job(s) are gone or update the
>>> pipeline.xml.  I have to go in and kill them manually on the Ergatis
>>> master.
>>>
>>> I think other SGE-job-removal situations might cause the same issue,
>>> but Aborted jobs are where I've consistently noticed the problem.
>>>
>>> Is this a known issue with Workflow, or do I maybe have something
>>> configured incorrectly?
>>>
>>> Thanks,
>>> Aaron
>>>
>>> ------------------------------------------------------------------------------
>>> Nokia and AT&T present the 2010 Calling All Innovators-North America contest
>>> Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
>>> $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
>>> Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
>>> http://p.sf.net/sfu/nokia-dev2dev
>>> _______________________________________________
>>> Ergatis-users mailing list
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/ergatis-users
>>
>>
>> ------------------------------------------------------------------------------
>> Nokia and AT&T present the 2010 Calling All Innovators-North America contest
>> Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
>> $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
>> Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
>> http://p.sf.net/sfu/nokia-dev2dev
>> _______________________________________________
>> Ergatis-users mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/ergatis-users
>>
>
> ------------------------------------------------------------------------------
> Nokia and AT&T present the 2010 Calling All Innovators-North America contest
> Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
> $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
> Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
> http://p.sf.net/sfu/nokia-dev2dev
> _______________________________________________
> Ergatis-users mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/ergatis-users
>
> ------------------------------------------------------------------------------
> Nokia and AT&T present the 2010 Calling All Innovators-North America contest
> Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
> $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
> Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
> http://p.sf.net/sfu/nokia-dev2dev
> _______________________________________________
> Ergatis-users mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/ergatis-users
>

------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users