일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |
- 개발
- 안드로이드
- 동적계획법
- DFS와BFS
- 분할정복
- Python
- Vue
- cos
- 백준
- Flutter
- android
- cos pro 1급
- AndroidStudio
- vuejs
- DART
- 알고리즘
- 코드품앗이
- 파이썬
- DFS
- cos pro
- C++
- codingtest
- 코테
- 동적계획법과최단거리역추적
- django
- Algorithm
- issue
- 안드로이드스튜디오
- 코딩테스트
- BAEKJOON
- Today
- Total
Development Artist
[LSF] TroubleShooting 모음. 본문
LSF TroubleShooting 모음
Problem 1:
mbatchd dies after primary administrator is changed.
The daemon mbatchd died with the following errors:
Nov 22 19:42:51 2010 31631 3 7.06 init_log: Log directory not owned by LSF administrator (owner ID is 1527)
Nov 22 19:42:51 2010 31631 3 7.06 getElock: Last owner of lock file was on this host with pid <31073>; attempting to take over lock file
Nov 22 19:42:51 2010 31631 3 7.06 Master daemon on host dying; fatal error - see above messages for reason
Solution:
1.Change the primary administrator back to the original administrator you set during installation.
Modify the configuration file lsf.cluster.cluster_name in $LSF_ENVDIR
Begin ClusterAdmins
Administrators = lsfadmin
End ClusterAdmins
2. Change the permissions of the file $LSF_TOP/work/cluster_name/logdir/lsb.events to the original administrator set during installation.
chown lsfadmin lsb.events
3. Kill the daemon sbatchd on the management host
4.Run command "badmin hstartup" to start sbatchd again.
Problem 2:
sbatchd Log shows:"Daemon on host received signal <15>; exiting" when you run /etc/init.d/lsf
ou use /etc/init.d/lsf to restart LSF daemons and receive the following messages in the sbatchd.log and lim.log mean.
sbatchd.log.host_name
Jun 2 10:42:26 2011 30636 3 7.06 Daemon on host received signal <15>; exiting
Jun 2 11:04:52 2011 30818 3 7.06 Daemon on host received signal <15>; exiting
lim.log.host_name
Jun 2 11:04:52 2011 30814 3 1.2.3 term_handler: Received signal 15, exiting
Jun 2 11:19:45 2011 4767 3 1.2.3 term_handler: Received signal 15, exitingThis is the default and unharmful behavior
Solution:
that occurs you use the /etc/init.d/lsf to restart the LSF dadmons. /etc/init.d/lsf invokes kill_daemons to kill the LSF daemons and the error messages above log in sbatchd.log and lim.log files.
Problem 3:
Error message: Received request <5> from non-LSF host.
Solution:
The master lim log file shows the error message: Received request <5> from non-LSF host.
Usually, when the lim log file log this message, there is another sentence above this message tells the IP address of the host, you can find out related host according to the IP address and take following actions:
- Is the "non-LSF host" a server in your cluster previously? If yes, logon the host and kill related LSF daemons
- Does the host has multiple IP address or the host's IP address changed? If yes, create the hosts file under $LSF_ENVDIR according to "xxx.xxx.xxx.xxx hostname" format to make sure the host has unique offical name
- Is the host a floating client host? If yes, it may caused by master lim busy. Remove the master hostname from the LSF_SERVER_HOST list in lsf.conf file to offload master lim
Problem 4:
Why is my cluster down and LSF daemon not responding?
Our whole cluster is down and we don't know how to proceed.
Facts
All commands come back wrong
LSF daemon (LIM) not responding ... still trying
We have run a /etc/init.d/lsf stop/start, but the results are the same.
After running ps -ef | grep lsf command on the master, the result is
showed below:
- root 6080 1 0 11:10:32 ? 0:00
/tools/platform-lsf-6.2/6.2/sparc-sol7-64/etc/res - root 6082 1 0 11:10:32 ? 0:00
/tools/platform-lsf-6.2/6.2/sparc-sol7-64/etc/sbatchd - root 6078 1 0 11:10:32 ? 0:00
/tools/platform-lsf-6.2/6.2/sparc-sol7-64/etc/lim - root 6103 6078 0 11:14:50 ? 0:00
/tools/platform-lsf-6.2/6.2/sparc-sol7-64/etc/pim
The log file from the master after the start/stop:
- Mar 28 11:10:23 2007 5952 3 6.2 term_handler: Received signal 15, exiting
- Mar 28 11:10:38 2007 6078 4 6.2 initNewMaster: I am the master now.
- Mar 28 11:12:22 2007 6078 3 6.2 licenses are currently used by root@lsf01-tx30(8.0), root@lsf01-tx32(2362.0),
- Mar 28 11:12:51 2007 6078 3 6.2 licenses are currently used by root@tiros(109.0), root@vs03-zmx24(59.0), root@cde-tx30-046g(250.0),
- Mar 28 11:13:24 2007 6078 3 6.2 getLicense(1): feature
License server does not support this feature
Feature: lsf_make
License path: 1800@swim-m
FLEXlm error: -18,147 - Mar 28 11:13:24 2007 6078 3 6.2 Your license server(s) may not be
referring to the same license file as your master lim. Lim will
re-initialize with the current license file. If you haven't done so
already, please shutdown and startup lmgrd to re-initialize the license
server(s).
The problem is coming from here :Mar 28 11:13:34 2007 6078 3 6.2 Your license server(s) may not be referring to the same license file as your master lim. Lim will re-initialize with the current license file. If you haven't done so already, please shutdown and startup lmgrd to re-initialize the license server(s).
Solution:
Please stop LSF and stop lmgrd and start lmgrd and LSF again.