Need help with ADE interface to LSF for simulation job distribution...

V

Vivek N

Guest
Hi

We are trying to write a customized job DRMS for our private cluster and we are having issues getting it to work.

We have followed the template in the SKILL file, but there are some unexpected results.

For example we see it launching the interface job and then that times out with:
\"Querying health for interface job 2, current health: unknown, new health: unknown\"
This repeats a few times till the job policy timeout ends.

After that our interface is called and the job status etc are OK - it looks like we have something wrong

Is there anyone who has successfully got this working?
I will describe the issues in more detail if someone has.

Thanks in advance
V
 
On Friday, August 14, 2020 at 3:12:36 PM UTC-7, rep....@gmail.com wrote:
Hi

We are trying to write a customized job DRMS for our private cluster and we are having issues getting it to work.

We have followed the template in the SKILL file, but there are some unexpected results.

For example we see it launching the interface job and then that times out with:
\"Querying health for interface job 2, current health: unknown, new health: unknown\"
This repeats a few times till the job policy timeout ends.

After that our interface is called and the job status etc are OK - it looks like we have something wrong

Is there anyone who has successfully got this working?
I will describe the issues in more detail if someone has.

Thanks in advance
V
I did a simple interface for Spectre jobs into a non-public DRMS that Sun/Oracle used about 8 years ago.
If you provide more detail on what you are doing, the template SKILL file, etc., I may be able to help.
 
Thanks for your response!

I will get back to you with details soon...
 
So - here is what is happening (we wanted to eliminate all external issues before we got back to you)

We have defined a axlJobIntfc custom subclass which does the following:
On submitjob we call a script with system() that launches the job on our remote cluster
getjobstatus gets called every 5 seconds, we query the status with another script (every 30 seconds)

axlJobIntfcHealthMethod has been changed to translate our cluster managers job status to \"alive\" or \"dead\" as follows:
SUBMITTED (until our scheduler starts the job) : alive
PROCESSING (job is running) :alive
COMPLETED: dead

We are able to launch this configuration with Cadence Virtuoso on a test solve
The job gets scheduled on our cluster, it runs
The log shows the correct statuses, and it goes from SUBMITTED to PROCESSING to COMPLETED and the remote job finishes.

However, once it completes, Cadence seems to not accept that the solve completed - it relaunches the job again and then again - totally 3 times and gives up with a warning dialog
Note that the remote job writes its output files to a commonly mounted network share, so the solvers output files are seen as if they were local.

We are not able to understand what we need to do to tell Cadence that the solve completed. Our assumption was that setting the \"dead\" health and prescence of the solved files should have done the trick....
What are we missing?

Thanks in advance
 

Welcome to EDABoard.com

Sponsor

Back
Top