Re: pipeline freezes in running state whereas no job is running

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: pipeline freezes in running state whereas no job is running

Schobesberger Richard - S0910595006
Hi all,

I experience similar problems.
The view stdout/stderr of the webservice tells me that the event.log.monitoring file couldn't be created.
The workflow_monitor however had access to the different logs.

The first log was created by a job running on brain16. This job finished and workflow realised it.

I~~~prolog starting
I~~~htc id   sge id[.task id]   date   message   hostname
S~~~26~~~183~~~Wed Dec 22 14:10:26 CET 2010~~~job started on workflow.q for sgeadmin~~~brain16
I~~~prolog ending
I~~~ wrapper script starting job
I~~~ Job Process id is 4091
F~~~26~~~181~~~Wed Dec 22 14:10:32 CET 2010~~~command finished~~~0
I~~~epilog starting
I~~~htc id   sge id[.task id]   date   message   hostname
T~~~26~~~183~~~Wed Dec 22 14:10:32 CET 2010~~~job finished on workflow.q for sgeadmin~~~brain16
I~~~epilog ending

The second log was created by a similar job at the same time (except for the inputfile both jobs were the same and ran in parallel) on brain10.
This job finished and workflow didn't realise it.

I~~~epilog starting
I~~~htc id   sge id[.task id]   date   message   hostname
T~~~27~~~184~~~Wed Dec 22 14:10:37 CET 2010~~~job finished on workflow.q for sgeadmin~~~brain10
I~~~epilog ending
I~~~prolog starting
I~~~htc id   sge id[.task id]   date   message   hostname
S~~~27~~~185~~~Wed Dec 22 14:10:41 CET 2010~~~job started on workflow.q for sgeadmin~~~brain06
I~~~prolog ending
I~~~ wrapper script starting job
I~~~ Job Process id is 19401
F~~~27~~~181~~~Wed Dec 22 14:28:45 CET 2010~~~command finished~~~0
I~~~epilog starting
I~~~htc id   sge id[.task id]   date   message   hostname
T~~~27~~~185~~~Wed Dec 22 14:28:45 CET 2010~~~job finished on workflow.q for sgeadmin~~~brain06
I~~~epilog ending

The jobs were writting in a mysql database and only if the run was correct the data was committed.
The data from both jobs were found in the database, so I conclude that the job itself had no problem but workflow didn't realise that it was finished.

What bugs me also is the event log of the second job, since the second job ran on brain10 (workflow_monitor and the "sge qacct" command tell the same), it seems that there was an error in the event log creation. After finding that brain06 was also used, I checked which job was performed on this host. It appears that after the first job on brain16 finished the next job was started on brain06. This third job also finished and I since it is the last job in the iterator it waited for the others to finish.

So the second and the third job shared the same event log, which I guess should not happen. Furthermore I wasn't able to find the prolog of brain10 as well as the command finished line in any other event log.

Does any one has an idea, whats wrong here?

Richard


On 12/08/2010 04:16 PM, Mahurkar, Anup wrote:

>
> Gerald,
>
>  
>
> What version of workflow are you running? Looking at the log messages it appears that there was a problem creating the event.log.monitoring file which is how workflow checks for job completion. If there is a problem with this file occasionally then Workflow thinks that the job never finished. Could you send me the event.log file for that particular command? If you look at the XML file for that command it should have the path to the event.log file
>
>  
>
> From: Gérald Salin [mailto:[hidden email]]
> Sent: Wednesday, December 08, 2010 9:21 AM
> To: [hidden email]
> Subject: [Ergatis-users] pipeline freezes in running state whereas no job is running
>
>  
>
> Hi all,
> we observed a strange behaviour on our instance of ergatis : a component which is in the running state in the pipeline view (http://genomique.genotoul.fr/tmp/ergatis_general.jpg) is in fact not running at all. Looking in the detail view (http://genomique.genotoul.fr/tmp/ergatis_detail.jpg), we can see that all the jobs have finished except the last one that is incomplete...but no one is running. I checked on our cluster : no sge job corresponding to this pipeline is running. A step is finished, but the newt one never begins (this component normally ends in less than an hour)
> The pipeline.xml.log keeps on growing (no ERROR nor FATAL logs in it). The pipeline.xml.run.out contains warnings about the event.log.monitoring file (see below)
>
> We can observe this behaviour for different pipelines, working on different data, at different steps
>
> Is that event.log.monitoring file can be the reason of our problems?
> what can be the reason of this "freeze"?
> Workflow id are mysql-based and pipeline id are file-based
>
> thank you for your help
>
> Gérald
>
> pipeline.xml.run.out
> WARN 15:14:04:512 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:14:14:953 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:14:18:341 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1332 Could not delete the file event.log.monitoring.  This may halt the event log file monitoring. Delete it manually.
> WARN 15:14:31:477 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:14:34:745 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:15:04:534 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:15:05:699 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1332 Could not delete the file event.log.monitoring.  This may halt the event log file monitoring. Delete it manually.
> WARN 15:15:12:181 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:15:23:797 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1332 Could not delete the file event.log.monitoring.  This may halt the event log file monitoring. Delete it manually.
> WARN 15:15:33:683 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:15:53:349 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:16:02:300 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:16:38:273 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1332 Could not delete the file event.log.monitoring.  This may halt the event log file monitoring. Delete it manually.
> WARN 15:16:49:322 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
>
>
> --
> Gérald Salin
> Informatique - Plateforme Génomique
> Génopole Toulouse Midi-Pyrénées
> Tél : 05.61.28.55.90
> Fax : 05.61.28.55.93
> web : http://genomique.genotoul.fr 


------------------------------------------------------------------------------
Forrester recently released a report on the Return on Investment (ROI) of
Google Apps. They found a 300% ROI, 38%-56% cost savings, and break-even
within 7 months.  Over 3 million businesses have gone Google with Google Apps:
an online email calendar, and document program that's accessible from your
browser. Read the Forrester report: http://p.sf.net/sfu/googleapps-sfnew
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users
Reply | Threaded
Open this post in threaded view
|

Re: pipeline freezes in running state whereas no job is running

Joshua Orvis
Richard -

Our primary Workflow Engine guy, Anup, has been out on vacation leave for the last few weeks.  I'm forwarding him this to see if he (or someone in his group) can help you out.

Joshua


On Wed, Dec 22, 2010 at 8:50 AM, Schobesberger Richard - S0910595006 <[hidden email]> wrote:
Hi all,

I experience similar problems.
The view stdout/stderr of the webservice tells me that the event.log.monitoring file couldn't be created.
The workflow_monitor however had access to the different logs.

The first log was created by a job running on brain16. This job finished and workflow realised it.

I~~~prolog starting
I~~~htc id   sge id[.task id]   date   message   hostname
S~~~26~~~183~~~Wed Dec 22 14:10:26 CET 2010~~~job started on workflow.q for sgeadmin~~~brain16
I~~~prolog ending
I~~~ wrapper script starting job
I~~~ Job Process id is 4091
F~~~26~~~181~~~Wed Dec 22 14:10:32 CET 2010~~~command finished~~~0
I~~~epilog starting
I~~~htc id   sge id[.task id]   date   message   hostname
T~~~26~~~183~~~Wed Dec 22 14:10:32 CET 2010~~~job finished on workflow.q for sgeadmin~~~brain16
I~~~epilog ending

The second log was created by a similar job at the same time (except for the inputfile both jobs were the same and ran in parallel) on brain10.
This job finished and workflow didn't realise it.

I~~~epilog starting
I~~~htc id   sge id[.task id]   date   message   hostname
T~~~27~~~184~~~Wed Dec 22 14:10:37 CET 2010~~~job finished on workflow.q for sgeadmin~~~brain10
I~~~epilog ending
I~~~prolog starting
I~~~htc id   sge id[.task id]   date   message   hostname
S~~~27~~~185~~~Wed Dec 22 14:10:41 CET 2010~~~job started on workflow.q for sgeadmin~~~brain06
I~~~prolog ending
I~~~ wrapper script starting job
I~~~ Job Process id is 19401
F~~~27~~~181~~~Wed Dec 22 14:28:45 CET 2010~~~command finished~~~0
I~~~epilog starting
I~~~htc id   sge id[.task id]   date   message   hostname
T~~~27~~~185~~~Wed Dec 22 14:28:45 CET 2010~~~job finished on workflow.q for sgeadmin~~~brain06
I~~~epilog ending

The jobs were writting in a mysql database and only if the run was correct the data was committed.
The data from both jobs were found in the database, so I conclude that the job itself had no problem but workflow didn't realise that it was finished.

What bugs me also is the event log of the second job, since the second job ran on brain10 (workflow_monitor and the "sge qacct" command tell the same), it seems that there was an error in the event log creation. After finding that brain06 was also used, I checked which job was performed on this host. It appears that after the first job on brain16 finished the next job was started on brain06. This third job also finished and I since it is the last job in the iterator it waited for the others to finish.

So the second and the third job shared the same event log, which I guess should not happen. Furthermore I wasn't able to find the prolog of brain10 as well as the command finished line in any other event log.

Does any one has an idea, whats wrong here?

Richard


On 12/08/2010 04:16 PM, Mahurkar, Anup wrote:
>
> Gerald,
>
>
>
> What version of workflow are you running? Looking at the log messages it appears that there was a problem creating the event.log.monitoring file which is how workflow checks for job completion. If there is a problem with this file occasionally then Workflow thinks that the job never finished. Could you send me the event.log file for that particular command? If you look at the XML file for that command it should have the path to the event.log file
>
>
>
> From: Gérald Salin [mailto:[hidden email]]
> Sent: Wednesday, December 08, 2010 9:21 AM
> To: [hidden email]
> Subject: [Ergatis-users] pipeline freezes in running state whereas no job is running
>
>
>
> Hi all,
> we observed a strange behaviour on our instance of ergatis : a component which is in the running state in the pipeline view (http://genomique.genotoul.fr/tmp/ergatis_general.jpg) is in fact not running at all. Looking in the detail view (http://genomique.genotoul.fr/tmp/ergatis_detail.jpg), we can see that all the jobs have finished except the last one that is incomplete...but no one is running. I checked on our cluster : no sge job corresponding to this pipeline is running. A step is finished, but the newt one never begins (this component normally ends in less than an hour)
> The pipeline.xml.log keeps on growing (no ERROR nor FATAL logs in it). The pipeline.xml.run.out contains warnings about the event.log.monitoring file (see below)
>
> We can observe this behaviour for different pipelines, working on different data, at different steps
>
> Is that event.log.monitoring file can be the reason of our problems?
> what can be the reason of this "freeze"?
> Workflow id are mysql-based and pipeline id are file-based
>
> thank you for your help
>
> Gérald
>
> pipeline.xml.run.out
> WARN 15:14:04:512 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:14:14:953 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:14:18:341 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1332 Could not delete the file event.log.monitoring.  This may halt the event log file monitoring. Delete it manually.
> WARN 15:14:31:477 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:14:34:745 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:15:04:534 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:15:05:699 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1332 Could not delete the file event.log.monitoring.  This may halt the event log file monitoring. Delete it manually.
> WARN 15:15:12:181 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:15:23:797 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1332 Could not delete the file event.log.monitoring.  This may halt the event log file monitoring. Delete it manually.
> WARN 15:15:33:683 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:15:53:349 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:16:02:300 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:16:38:273 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1332 Could not delete the file event.log.monitoring.  This may halt the event log file monitoring. Delete it manually.
> WARN 15:16:49:322 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
>
>
> --
> Gérald Salin
> Informatique - Plateforme Génomique
> Génopole Toulouse Midi-Pyrénées
> Tél : 05.61.28.55.90
> Fax : 05.61.28.55.93
> web : http://genomique.genotoul.fr


------------------------------------------------------------------------------
Forrester recently released a report on the Return on Investment (ROI) of
Google Apps. They found a 300% ROI, 38%-56% cost savings, and break-even
within 7 months.  Over 3 million businesses have gone Google with Google Apps:
an online email calendar, and document program that's accessible from your
browser. Read the Forrester report: http://p.sf.net/sfu/googleapps-sfnew
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users


------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users
Reply | Threaded
Open this post in threaded view
|

Re: pipeline freezes in running state whereas no job is running

Schobesberger Richard - S0910595006
In reply to this post by Schobesberger Richard - S0910595006
Hi all,

I closed down the problem to the id distribution of workflow. I'm using the Database ID Generation and used the get_last_id procedure specified on the ergatis mediawiki.

   CREATE PROCEDURE get_next_id (IN count INT, OUT param1 NUMERIC)
   BEGIN
     UPDATE sequence SET id=LAST_INSERT_ID(id+1);
     SELECT LAST_INSERT_ID() INTO param1;
   END

I realized that if two similar programmes ran parallel the ID difference between them is too small.
Meaning if programme 1 has the ID 15 programme 2 gets the 16. But subprogramme 1.1 gets than 16 and subprogramme 2.1 gets the 17. The next subprogramme 1.2 gets a 17 and apparently if subprogramme 2.1 is not finished already subprogramme 2.1 and 1.2 write into the same log file.

So i looked at the procedure and I didn't work out what the parameter 'count' is doing. So I changed the LAST_INSERT_ID(id+1); to LAST_INSERT_ID(id+count); and the problem with the ID was gone.

I'm still not sure what count is actually doing since I don't have access to the source code but probably there was a mistake in the mediawiki entry of the procedure.

Hope that helps!

Richard


________________________________________
Von: Schobesberger Richard - S0910595006
Gesendet: Mittwoch, 22. Dezember 2010 15:50
An: [hidden email]
Betreff: Re: [Ergatis-users] pipeline freezes in running state whereas no job is running

Hi all,

I experience similar problems.
The view stdout/stderr of the webservice tells me that the event.log.monitoring file couldn't be created.
The workflow_monitor however had access to the different logs.

The first log was created by a job running on brain16. This job finished and workflow realised it.

I~~~prolog starting
I~~~htc id   sge id[.task id]   date   message   hostname
S~~~26~~~183~~~Wed Dec 22 14:10:26 CET 2010~~~job started on workflow.q for sgeadmin~~~brain16
I~~~prolog ending
I~~~ wrapper script starting job
I~~~ Job Process id is 4091
F~~~26~~~181~~~Wed Dec 22 14:10:32 CET 2010~~~command finished~~~0
I~~~epilog starting
I~~~htc id   sge id[.task id]   date   message   hostname
T~~~26~~~183~~~Wed Dec 22 14:10:32 CET 2010~~~job finished on workflow.q for sgeadmin~~~brain16
I~~~epilog ending

The second log was created by a similar job at the same time (except for the inputfile both jobs were the same and ran in parallel) on brain10.
This job finished and workflow didn't realise it.

I~~~epilog starting
I~~~htc id   sge id[.task id]   date   message   hostname
T~~~27~~~184~~~Wed Dec 22 14:10:37 CET 2010~~~job finished on workflow.q for sgeadmin~~~brain10
I~~~epilog ending
I~~~prolog starting
I~~~htc id   sge id[.task id]   date   message   hostname
S~~~27~~~185~~~Wed Dec 22 14:10:41 CET 2010~~~job started on workflow.q for sgeadmin~~~brain06
I~~~prolog ending
I~~~ wrapper script starting job
I~~~ Job Process id is 19401
F~~~27~~~181~~~Wed Dec 22 14:28:45 CET 2010~~~command finished~~~0
I~~~epilog starting
I~~~htc id   sge id[.task id]   date   message   hostname
T~~~27~~~185~~~Wed Dec 22 14:28:45 CET 2010~~~job finished on workflow.q for sgeadmin~~~brain06
I~~~epilog ending

The jobs were writting in a mysql database and only if the run was correct the data was committed.
The data from both jobs were found in the database, so I conclude that the job itself had no problem but workflow didn't realise that it was finished.

What bugs me also is the event log of the second job, since the second job ran on brain10 (workflow_monitor and the "sge qacct" command tell the same), it seems that there was an error in the event log creation. After finding that brain06 was also used, I checked which job was performed on this host. It appears that after the first job on brain16 finished the next job was started on brain06. This third job also finished and I since it is the last job in the iterator it waited for the others to finish.

So the second and the third job shared the same event log, which I guess should not happen. Furthermore I wasn't able to find the prolog of brain10 as well as the command finished line in any other event log.

Does any one has an idea, whats wrong here?

Richard


On 12/08/2010 04:16 PM, Mahurkar, Anup wrote:

>
> Gerald,
>
>
>
> What version of workflow are you running? Looking at the log messages it appears that there was a problem creating the event.log.monitoring file which is how workflow checks for job completion. If there is a problem with this file occasionally then Workflow thinks that the job never finished. Could you send me the event.log file for that particular command? If you look at the XML file for that command it should have the path to the event.log file
>
>
>
> From: Gérald Salin [mailto:[hidden email]]
> Sent: Wednesday, December 08, 2010 9:21 AM
> To: [hidden email]
> Subject: [Ergatis-users] pipeline freezes in running state whereas no job is running
>
>
>
> Hi all,
> we observed a strange behaviour on our instance of ergatis : a component which is in the running state in the pipeline view (http://genomique.genotoul.fr/tmp/ergatis_general.jpg) is in fact not running at all. Looking in the detail view (http://genomique.genotoul.fr/tmp/ergatis_detail.jpg), we can see that all the jobs have finished except the last one that is incomplete...but no one is running. I checked on our cluster : no sge job corresponding to this pipeline is running. A step is finished, but the newt one never begins (this component normally ends in less than an hour)
> The pipeline.xml.log keeps on growing (no ERROR nor FATAL logs in it). The pipeline.xml.run.out contains warnings about the event.log.monitoring file (see below)
>
> We can observe this behaviour for different pipelines, working on different data, at different steps
>
> Is that event.log.monitoring file can be the reason of our problems?
> what can be the reason of this "freeze"?
> Workflow id are mysql-based and pipeline id are file-based
>
> thank you for your help
>
> Gérald
>
> pipeline.xml.run.out
> WARN 15:14:04:512 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:14:14:953 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:14:18:341 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1332 Could not delete the file event.log.monitoring.  This may halt the event log file monitoring. Delete it manually.
> WARN 15:14:31:477 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:14:34:745 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:15:04:534 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:15:05:699 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1332 Could not delete the file event.log.monitoring.  This may halt the event log file monitoring. Delete it manually.
> WARN 15:15:12:181 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:15:23:797 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1332 Could not delete the file event.log.monitoring.  This may halt the event log file monitoring. Delete it manually.
> WARN 15:15:33:683 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:15:53:349 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:16:02:300 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
> WARN 15:16:38:273 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1332 Could not delete the file event.log.monitoring.  This may halt the event log file monitoring. Delete it manually.
> WARN 15:16:49:322 [Thread: (0) Monitor Command 6197392] SGERunner monitorForCompletion:1205 Failed creating file event.log.monitoring.
>
>
> --
> Gérald Salin
> Informatique - Plateforme Génomique
> Génopole Toulouse Midi-Pyrénées
> Tél : 05.61.28.55.90
> Fax : 05.61.28.55.93
> web : http://genomique.genotoul.fr


------------------------------------------------------------------------------
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand
malware threats, the impact they can have on your business, and how you
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users
Reply | Threaded
Open this post in threaded view
|

Re: pipeline freezes in running state whereas no job is running

Mahurkar, Anup
Richard,

You stumbled upon the right answer. Thanks for catching a problem in the
stored procedure. Workflow has a mechanism to cache IDs locally so we do
not hit the server every time we need an ID. The count parameter is used
to set the batch size of this cache. Workflow config file has a parameter
that specifies this batch size.  We will modify the documentation
accordingly.

Regards,
Anup

On 1/12/11 3:58 AM, "Schobesberger Richard - S0910595006"
<[hidden email]> wrote:

>Hi all,
>
>I closed down the problem to the id distribution of workflow. I'm using
>the Database ID Generation and used the get_last_id procedure specified
>on the ergatis mediawiki.
>
>   CREATE PROCEDURE get_next_id (IN count INT, OUT param1 NUMERIC)
>   BEGIN
>     UPDATE sequence SET id=LAST_INSERT_ID(id+1);
>     SELECT LAST_INSERT_ID() INTO param1;
>   END
>
>I realized that if two similar programmes ran parallel the ID difference
>between them is too small.
>Meaning if programme 1 has the ID 15 programme 2 gets the 16. But
>subprogramme 1.1 gets than 16 and subprogramme 2.1 gets the 17. The next
>subprogramme 1.2 gets a 17 and apparently if subprogramme 2.1 is not
>finished already subprogramme 2.1 and 1.2 write into the same log file.
>
>So i looked at the procedure and I didn't work out what the parameter
>'count' is doing. So I changed the LAST_INSERT_ID(id+1); to
>LAST_INSERT_ID(id+count); and the problem with the ID was gone.
>
>I'm still not sure what count is actually doing since I don't have access
>to the source code but probably there was a mistake in the mediawiki
>entry of the procedure.
>
>Hope that helps!
>
>Richard
>
>
>________________________________________
>Von: Schobesberger Richard - S0910595006
>Gesendet: Mittwoch, 22. Dezember 2010 15:50
>An: [hidden email]
>Betreff: Re: [Ergatis-users] pipeline freezes in running state whereas no
>job is running
>
>Hi all,
>
>I experience similar problems.
>The view stdout/stderr of the webservice tells me that the
>event.log.monitoring file couldn't be created.
>The workflow_monitor however had access to the different logs.
>
>The first log was created by a job running on brain16. This job finished
>and workflow realised it.
>
>I~~~prolog starting
>I~~~htc id   sge id[.task id]   date   message   hostname
>S~~~26~~~183~~~Wed Dec 22 14:10:26 CET 2010~~~job started on workflow.q
>for sgeadmin~~~brain16
>I~~~prolog ending
>I~~~ wrapper script starting job
>I~~~ Job Process id is 4091
>F~~~26~~~181~~~Wed Dec 22 14:10:32 CET 2010~~~command finished~~~0
>I~~~epilog starting
>I~~~htc id   sge id[.task id]   date   message   hostname
>T~~~26~~~183~~~Wed Dec 22 14:10:32 CET 2010~~~job finished on workflow.q
>for sgeadmin~~~brain16
>I~~~epilog ending
>
>The second log was created by a similar job at the same time (except for
>the inputfile both jobs were the same and ran in parallel) on brain10.
>This job finished and workflow didn't realise it.
>
>I~~~epilog starting
>I~~~htc id   sge id[.task id]   date   message   hostname
>T~~~27~~~184~~~Wed Dec 22 14:10:37 CET 2010~~~job finished on workflow.q
>for sgeadmin~~~brain10
>I~~~epilog ending
>I~~~prolog starting
>I~~~htc id   sge id[.task id]   date   message   hostname
>S~~~27~~~185~~~Wed Dec 22 14:10:41 CET 2010~~~job started on workflow.q
>for sgeadmin~~~brain06
>I~~~prolog ending
>I~~~ wrapper script starting job
>I~~~ Job Process id is 19401
>F~~~27~~~181~~~Wed Dec 22 14:28:45 CET 2010~~~command finished~~~0
>I~~~epilog starting
>I~~~htc id   sge id[.task id]   date   message   hostname
>T~~~27~~~185~~~Wed Dec 22 14:28:45 CET 2010~~~job finished on workflow.q
>for sgeadmin~~~brain06
>I~~~epilog ending
>
>The jobs were writting in a mysql database and only if the run was
>correct the data was committed.
>The data from both jobs were found in the database, so I conclude that
>the job itself had no problem but workflow didn't realise that it was
>finished.
>
>What bugs me also is the event log of the second job, since the second
>job ran on brain10 (workflow_monitor and the "sge qacct" command tell the
>same), it seems that there was an error in the event log creation. After
>finding that brain06 was also used, I checked which job was performed on
>this host. It appears that after the first job on brain16 finished the
>next job was started on brain06. This third job also finished and I since
>it is the last job in the iterator it waited for the others to finish.
>
>So the second and the third job shared the same event log, which I guess
>should not happen. Furthermore I wasn't able to find the prolog of
>brain10 as well as the command finished line in any other event log.
>
>Does any one has an idea, whats wrong here?
>
>Richard
>
>
>On 12/08/2010 04:16 PM, Mahurkar, Anup wrote:
>>
>> Gerald,
>>
>>
>>
>> What version of workflow are you running? Looking at the log messages
>>it appears that there was a problem creating the event.log.monitoring
>>file which is how workflow checks for job completion. If there is a
>>problem with this file occasionally then Workflow thinks that the job
>>never finished. Could you send me the event.log file for that particular
>>command? If you look at the XML file for that command it should have the
>>path to the event.log file
>>
>>
>>
>> From: Gérald Salin [mailto:[hidden email]]
>> Sent: Wednesday, December 08, 2010 9:21 AM
>> To: [hidden email]
>> Subject: [Ergatis-users] pipeline freezes in running state whereas no
>>job is running
>>
>>
>>
>> Hi all,
>> we observed a strange behaviour on our instance of ergatis : a
>>component which is in the running state in the pipeline view
>>(http://genomique.genotoul.fr/tmp/ergatis_general.jpg) is in fact not
>>running at all. Looking in the detail view
>>(http://genomique.genotoul.fr/tmp/ergatis_detail.jpg), we can see that
>>all the jobs have finished except the last one that is incomplete...but
>>no one is running. I checked on our cluster : no sge job corresponding
>>to this pipeline is running. A step is finished, but the newt one never
>>begins (this component normally ends in less than an hour)
>> The pipeline.xml.log keeps on growing (no ERROR nor FATAL logs in it).
>>The pipeline.xml.run.out contains warnings about the
>>event.log.monitoring file (see below)
>>
>> We can observe this behaviour for different pipelines, working on
>>different data, at different steps
>>
>> Is that event.log.monitoring file can be the reason of our problems?
>> what can be the reason of this "freeze"?
>> Workflow id are mysql-based and pipeline id are file-based
>>
>> thank you for your help
>>
>> Gérald
>>
>> pipeline.xml.run.out
>> WARN 15:14:04:512 [Thread: (0) Monitor Command 6197392] SGERunner
>>monitorForCompletion:1205 Failed creating file event.log.monitoring.
>> WARN 15:14:14:953 [Thread: (0) Monitor Command 6197392] SGERunner
>>monitorForCompletion:1205 Failed creating file event.log.monitoring.
>> WARN 15:14:18:341 [Thread: (0) Monitor Command 6197392] SGERunner
>>monitorForCompletion:1332 Could not delete the file
>>event.log.monitoring.  This may halt the event log file monitoring.
>>Delete it manually.
>> WARN 15:14:31:477 [Thread: (0) Monitor Command 6197392] SGERunner
>>monitorForCompletion:1205 Failed creating file event.log.monitoring.
>> WARN 15:14:34:745 [Thread: (0) Monitor Command 6197392] SGERunner
>>monitorForCompletion:1205 Failed creating file event.log.monitoring.
>> WARN 15:15:04:534 [Thread: (0) Monitor Command 6197392] SGERunner
>>monitorForCompletion:1205 Failed creating file event.log.monitoring.
>> WARN 15:15:05:699 [Thread: (0) Monitor Command 6197392] SGERunner
>>monitorForCompletion:1332 Could not delete the file
>>event.log.monitoring.  This may halt the event log file monitoring.
>>Delete it manually.
>> WARN 15:15:12:181 [Thread: (0) Monitor Command 6197392] SGERunner
>>monitorForCompletion:1205 Failed creating file event.log.monitoring.
>> WARN 15:15:23:797 [Thread: (0) Monitor Command 6197392] SGERunner
>>monitorForCompletion:1332 Could not delete the file
>>event.log.monitoring.  This may halt the event log file monitoring.
>>Delete it manually.
>> WARN 15:15:33:683 [Thread: (0) Monitor Command 6197392] SGERunner
>>monitorForCompletion:1205 Failed creating file event.log.monitoring.
>> WARN 15:15:53:349 [Thread: (0) Monitor Command 6197392] SGERunner
>>monitorForCompletion:1205 Failed creating file event.log.monitoring.
>> WARN 15:16:02:300 [Thread: (0) Monitor Command 6197392] SGERunner
>>monitorForCompletion:1205 Failed creating file event.log.monitoring.
>> WARN 15:16:38:273 [Thread: (0) Monitor Command 6197392] SGERunner
>>monitorForCompletion:1332 Could not delete the file
>>event.log.monitoring.  This may halt the event log file monitoring.
>>Delete it manually.
>> WARN 15:16:49:322 [Thread: (0) Monitor Command 6197392] SGERunner
>>monitorForCompletion:1205 Failed creating file event.log.monitoring.
>>
>>
>> --
>> Gérald Salin
>> Informatique - Plateforme Génomique
>> Génopole Toulouse Midi-Pyrénées
>> Tél : 05.61.28.55.90
>> Fax : 05.61.28.55.93
>> web : http://genomique.genotoul.fr
>
>
>--------------------------------------------------------------------------
>----
>Protect Your Site and Customers from Malware Attacks
>Learn about various malware tactics and how to avoid them. Understand
>malware threats, the impact they can have on your business, and how you
>can protect your company and customers by using code signing.
>http://p.sf.net/sfu/oracle-sfdevnl
>_______________________________________________
>Ergatis-users mailing list
>[hidden email]
>https://lists.sourceforge.net/lists/listinfo/ergatis-users


------------------------------------------------------------------------------
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand
malware threats, the impact they can have on your business, and how you
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users