Development Artist

[LSF] TroubleShooting 모음. 본문

TroubleShooting/Linux Issue

[LSF] TroubleShooting 모음.

JMcunst 2022. 6. 3. 11:22
728x90
반응형

LSF TroubleShooting 모음


Problem 1:

mbatchd dies after primary administrator is changed.

 

The daemon mbatchd died with the following errors:

Nov 22 19:42:51 2010 31631 3 7.06 init_log: Log directory not owned by LSF administrator (owner ID is 1527)
Nov 22 19:42:51 2010 31631 3 7.06 getElock: Last owner of lock file was on this host with pid <31073>; attempting to take over lock file
Nov 22 19:42:51 2010 31631 3 7.06 Master daemon on host dying; fatal error - see above messages for reason
To solve this issue,

Solution:

 1.Change the primary administrator back to the original administrator you set during installation.
Modify the configuration file lsf.cluster.cluster_name in $LSF_ENVDIR

Begin ClusterAdmins
Administrators = lsfadmin
End ClusterAdmins

 2. Change the permissions of the file $LSF_TOP/work/cluster_name/logdir/lsb.events to the original administrator set during installation.

chown lsfadmin lsb.events

 3. Kill the daemon sbatchd on the management host
 4.Run command "badmin hstartup" to start sbatchd again.


Problem 2:

sbatchd Log shows:"Daemon on host received signal <15>; exiting" when you run /etc/init.d/lsf

ou use /etc/init.d/lsf to restart LSF daemons and receive the following messages in the sbatchd.log and lim.log mean.

sbatchd.log.host_name

Jun 2 10:42:26 2011 30636 3 7.06 Daemon on host received signal <15>; exiting
Jun 2 11:04:52 2011 30818 3 7.06 Daemon on host received signal <15>; exiting

lim.log.host_name

Jun 2 11:04:52 2011 30814 3 1.2.3 term_handler: Received signal 15, exiting
Jun 2 11:19:45 2011 4767 3 1.2.3 term_handler: Received signal 15, exitingThis is the default and unharmful behavior

 

Solution:

 that occurs you use the /etc/init.d/lsf to restart the LSF dadmons. /etc/init.d/lsf invokes kill_daemons to kill the LSF daemons and the error messages above log in sbatchd.log and lim.log files.


Problem 3:

Error message: Received request <5> from non-LSF host.

 

Solution:

The master lim log file shows the error message: Received request <5> from non-LSF host.

Usually, when the lim log file log this message, there is another sentence above this message tells the IP address of the host, you can find out related host according to the IP address and take following actions:

  • Is the "non-LSF host" a server in your cluster previously? If yes, logon the host and kill related LSF daemons
  • Does the host has multiple IP address or the host's IP address changed? If yes, create the hosts file under $LSF_ENVDIR according to "xxx.xxx.xxx.xxx hostname" format to make sure the host has unique offical name
  • Is the host a floating client host? If yes, it may caused by master lim busy. Remove the master hostname from the LSF_SERVER_HOST list in lsf.conf file to offload master lim

Problem 4:

Why is my cluster down and LSF daemon not responding?

 

Our whole cluster is down and we don't know how to proceed.

Facts
All commands come back wrong
LSF daemon (LIM) not responding ... still trying

We have run a /etc/init.d/lsf stop/start, but the results are the same.

After running ps -ef | grep lsf command on the master, the result is
showed below:

  • root 6080 1 0 11:10:32 ? 0:00
    /tools/platform-lsf-6.2/6.2/sparc-sol7-64/etc/res
  • root 6082 1 0 11:10:32 ? 0:00
    /tools/platform-lsf-6.2/6.2/sparc-sol7-64/etc/sbatchd
  • root 6078 1 0 11:10:32 ? 0:00
    /tools/platform-lsf-6.2/6.2/sparc-sol7-64/etc/lim
  • root 6103 6078 0 11:14:50 ? 0:00
    /tools/platform-lsf-6.2/6.2/sparc-sol7-64/etc/pim

The log file from the master after the start/stop:

  • Mar 28 11:10:23 2007 5952 3 6.2 term_handler: Received signal 15, exiting
  • Mar 28 11:10:38 2007 6078 4 6.2 initNewMaster: I am the master now.
  • Mar 28 11:12:22 2007 6078 3 6.2 licenses are currently used by root@lsf01-tx30(8.0), root@lsf01-tx32(2362.0),
  • Mar 28 11:12:51 2007 6078 3 6.2 licenses are currently used by root@tiros(109.0), root@vs03-zmx24(59.0), root@cde-tx30-046g(250.0),
  • Mar 28 11:13:24 2007 6078 3 6.2 getLicense(1): feature
    License server does not support this feature
    Feature:       lsf_make
    License path:  1800@swim-m
    FLEXlm error:  -18,147
  • Mar 28 11:13:24 2007 6078 3 6.2 Your license server(s) may not be
    referring to the same license file as your master lim.  Lim will
    re-initialize with the current license file. If you haven't done so
    already, please shutdown and startup lmgrd to re-initialize the license
    server(s).
    The problem is coming from here :Mar 28 11:13:34 2007 6078 3 6.2 Your license server(s) may not be referring to the same license file as your master lim. Lim will re-initialize with the current license file. If you haven't done so already, please shutdown and startup lmgrd to re-initialize the license server(s).

Solution:

Please stop LSF and stop lmgrd  and  start lmgrd and LSF again.

 

728x90
반응형
Comments