> For the complete documentation index, see [llms.txt](https://baiyongan.gitbook.io/lsf-handbook/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://baiyongan.gitbook.io/lsf-handbook/administrator_fundations/troubleshooting_lsf_problems/lsf_error_messages.md).

# LSF 错误信息

以下错误消息由 LSF 守护程序记录，或由 **lsadmin ckconfig** 和 **badmin ckconfig** 命令显示。

## 一般性错误

The following messages can be generated by any LSF daemon:

* can’t open file: error

  The daemon might not open the named file for the reason that is given by error. This error is usually caused by incorrect file permissions or missing files. All directories in the path to the configuration files must have execute (x) permission for the LSF administrator, and the actual files must have read (r) permission.

  Missing files might be caused by the following errors:

  * Incorrect path names in the lsf.conf file
  * Running LSF daemons on a host where the configuration files are not installed
  * Having a symbolic link that points to a file or directory that does not exist
* file(line): malloc failed

  Memory allocation failed. Either the host does not have enough available memory or swap space, or there is an internal error in the daemon. Check the program load and available swap space on the host. If the swap space is full, you must add more swap space or run fewer (or smaller) programs on that host.
* auth\_user: getservbyname(ident/tcp) failed: error; ident must be registered in services

  The **LSF\_AUTH=ident** parameter is defined in the lsf.conf file, but the ident/tcp service is not defined in the services database. Add ident/tcp to the services database, or remove the **LSF\_AUTH=ident** parameter from the lsf.conf file and use the **setuid root** command on the LSF files that require authentication.
* auth\_user: operation(/) failed: error

  The **LSF\_AUTH=ident** parameter is defined in the lsf.conf file, but the LSF daemon failed to contact the **identd** daemon on the host. Check that **identd** is defined in inetd.conf and the **identd** daemon is running on host.
* auth\_user: Authentication data format error (rbuf=) from /

  auth\_user: Authentication port mismatch (...) from /

  The **LSF\_AUTH=ident** parameter is defined in the lsf.conf file, but there is a protocol error between LSF and the **ident** daemon on host. Make sure that the **identd** daemon on the host is configured correctly.
* userok: Request from bad port (), denied

  The **LSF\_AUTH=ident** parameter is not defined, and the LSF daemon received a request that originates from a non-privileged port. The request is not serviced.

  Set the LSF binary files to be owned by root with the **setuid** bit set, or define the **LSF\_AUTH=ident** parameter and set up an **ident** server on all hosts in the cluster. If the files are on an NFS-mounted file system, make sure that the file system is not mounted with the **nosuid** flag.
* userok: Forged username suspected from /: /

  The service request claimed to come from user claimed\_user but ident authentication returned that the user was actual\_user. The request was not serviced.
* userok: ruserok(,) failed

  The **LSF\_USE\_HOSTEQUIV** parameter is defined in the lsf.conf file, but host is not set up as an equivalent host in /etc/host.equiv, and user uid is not set up in a .rhosts file.
* init\_AcceptSock: RES service(res) not registered, exiting

  init\_AcceptSock: res/tcp: unknown service, exiting

  initSock: LIM service not registered.

  initSock: Service lim/udp is unknown. Read LSF Guide for help

  get\_ports:  service not registered

  The LSF services are not registered.
* init\_AcceptSock: Can’t bind daemon socket to port : error, exiting

  init\_ServSock: Could not bind socket to port : error

  These error messages can occur if you try to start a second LSF daemon (for example, RES is already running, and you run RES again). If so, and you want to start the new daemon, kill the running daemon or use the **lsadmin** or **badmin** commands to shut down or restart the daemon.

## 配置错误

The following messages are caused by problems in the LSF configuration files. General errors are listed first, and then errors from specific files.

* file(line): Section name expected after Begin; ignoring section

  file(line): Invalid section name name; ignoring section

  The keyword Begin at the specified line is not followed by a section name, or is followed by an unrecognized section name.
* file(line): section section: Premature EOF

  The end of file was reached before reading the End section line for the named section.
* file(line): keyword line format error for section section; Ignore this section

  The first line of the section must contain a list of keywords. This error is logged when the keyword line is incorrect or contains an unrecognized keyword.
* file(line): values do not match keys for section section; Ignoring line

  The number of fields on a line in a configuration section does not match the number of keywords. This error can be caused by not putting () in a column to represent the default value.
* file: HostModel section missing or invalid

  file: Resource section missing or invalid

  file: HostType section missing or invalid

  The HostModel, Resource, or HostType section in the lsf.shared file is either missing or contains an unrecoverable error.
* file(line): Name name reserved or previously defined. Ignoring index

  The name that is assigned to an external load index must not be the same as any built-in or previously defined resource or load index.
* file(line): Duplicate clustername name in section cluster. Ignoring current line

  A cluster name is defined twice in the same lsf.shared file. The second definition is ignored.
* file(line): Bad cpuFactor for host model model. Ignoring line

  The CPU factor declared for the named host model in the lsf.shared file is not a valid number.
* file(line): Too many host models, ignoring model name

  You can declare a maximum of 127 host models in the lsf.shared file.
* file(line): Resource name name too long in section resource. Should be less than 40 characters. Ignoring line

  The maximum length of a resource name is 39 characters. Choose a shorter name for the resource.
* file(line): Resource name name reserved or previously defined. Ignoring line.

  You attempted to define a resource name that is reserved by LSF or already defined in the lsf.shared file. Choose another name for the resource.
* file(line): illegal character in resource name: name, section resource. Line ignored.

  Resource names must begin with a letter in the set \[a-zA-Z], followed by letters, digits, or underscores \[a-zA-Z0-9\_].

## LIM 信息

The following messages are logged by the LIM:

* findHostbyAddr/: Host / is unknown by&#x20;

  function: Gethostbyaddr\_(/) failed: error

  main: Request from unknown host /: error

  function: Received request from non-LSF host /

  The daemon does not recognize host. The request is not serviced. These messages can occur if host was added to the configuration files, but not all the daemons were reconfigured to read the new information. If the problem still occurs after reconfiguring all the daemons, check whether the host is a multi-addressed host.
* rcvLoadVector: Sender (/) may have different config?

  MasterRegister: Sender (host) may have different config?

  LIM detected inconsistent configuration information with the sending LIM. Run the following command so that all the LIMs have the same configuration information.

  ```
  lsadmin reconfig
  ```

  Note any hosts that failed to be contacted.
* rcvLoadVector: Got load from client-only host /. Kill LIM on /

  A LIM is running on a client host. Run the following command, or go to the client host and kill the LIM daemon.

  ```
  lsadmin limshutdown host
  ```
* saveIndx: Unknown index name  from ELIM

  LIM received an external load index name that is not defined in the lsf.shared file. If name is defined in lsf.shared, reconfigure the LIM. Otherwise, add name to the lsf.shared file and reconfigure all the LIMs.
* saveIndx: ELIM over-riding value of index&#x20;

  This warning message is logged when the ELIM sent a value for one of the built-in index names. LIM uses the value from ELIM in place of the value that is obtained from the kernel.
* getusr: Protocol error numIndx not read (cc=num): error

  getusr: Protocol error on index number (cc=num): error

  Protocol error between ELIM and LIM.

## RES 信息

The following messages are logged by the RES:

* doacceptconn: getpwnam(\<username>@/) failed: error

  doacceptconn: User  has uid  on client host /, uid  on RES host; assume bad user

  authRequest: username/uid /\<uid>@/ does not exist

  authRequest: Submitter’s name \<clname>@ is different from name  on this host

  RES assumes that a user has the same user ID and user name on all the LSF hosts. These messages occur if this assumption is violated. If the user is allowed to use LSF for interactive remote execution, make sure the user’s account has the same user ID and user name on all LSF hosts.
* doacceptconn: root remote execution permission denied

  authRequest: root job submission rejected

  Root tried to run or submit a job but **LSF\_ROOT\_REX** is not defined in the lsf.conf file.
* resControl: operation permission denied, uid =&#x20;

  The user with user ID uid is not allowed to make RES control requests. Only the LSF administrator can make RES control requests. If the **LSF\_ROOT\_REX** parameter is defined in the lsf.conffile, can also make RES control requests.
* resControl: access(respath, X\_OK): error

  The RES received a restart request, but failed to find the file respath to re-execute itself. Make sure respath contains the RES binary, and it has execution permission.

## mbatchd 和 sbatchd 信息

The following messages are logged by the **mbatchd** and **sbatchd** daemons:

* renewJob: Job : rename(,) failed: error

  **mbatchd** failed in trying to resubmit a rerunnable job. Check that the file from exists and that the LSF administrator can rename the file. If from is in an AFS directory, check that the LSF administrator’s token processing is properly setup.
* logJobInfo\_: fopen(\<logdir/info/jobfile>) failed: error

  logJobInfo\_: write \<logdir/info/jobfile>  failed: error

  logJobInfo\_: seek \<logdir/info/jobfile> failed: error

  logJobInfo\_: write \<logdir/info/jobfile> xdrpos  failed: error

  logJobInfo\_: write \<logdir/info/jobfile> xdr buf len  failed: error

  logJobInfo\_: close(\<logdir/info/jobfile>) failed: error

  rmLogJobInfo: Job : can’t unlink(\<logdir/info/jobfile>): error

  rmLogJobInfo\_: Job : can’t stat(\<logdir/info/jobfile>): error

  readLogJobInfo: Job  can’t open(\<logdir/info/jobfile>): error

  start\_job: Job : readLogJobInfo failed: error

  readLogJobInfo: Job : can’t read(\<logdir/info/jobfile>) size size: error

  initLog: mkdir(\<logdir/info>) failed: error

  : fopen(\<logdir/file> failed: error

  getElogLock: Can’t open existing lock file \<logdir/file>: error

  getElogLock: Error in opening lock file \<logdir/file>: error

  releaseElogLock: unlink(\<logdir/lockfile>) failed: error

  touchElogLock: Failed to open lock file \<logdir/file>: error

  touchElogLock: close \<logdir/file> failed: error

  **mbatchd** failed to create, remove, read, or write the log directory or a file in the log directory, for the reason that is given in error. Check that the LSF administrator has read, write, and execute permissions on the logdir directory.
* replay\_newjob: File  at line : Queue  not found, saving to queue&#x20;

  replay\_switchjob: File  at line : Destination queue  not found, switching to queue&#x20;

  When the **mbatchd** daemon was reconfigured, jobs were found in queue but that queue is no longer in the configuration.
* replay\_startjob: JobId : exec host  not found, saving to host&#x20;

  When the **mbatchd** daemon was reconfigured, the event log contained jobs that are dispatched to host, but that host is no longer configured to be used by LSF.
* do\_restartReq: Failed to get hData of host /

  **mbatchd** received a request from **sbatchd** on host host\_name, but that host is not known to **mbatchd**. Either the configuration file has changed but **mbatchd** was not reconfigured to pick up the new configuration, or host\_name is a client host but the **sbatchd** daemon is running on that host. Run the following command to reconfigure the **mbatchd** daemon or kill the **sbatchd** daemon on host\_name.

  ```
  badmin reconfig
  ```

## LSF 命令信息

LSF daemon (LIM) not responding ... still trying

During LIM restart, LSF commands might fail and display this error message. User programs that are linked to the LIM API also fail for the same reason. This message is displayed when LIM running on the master host list or server host list is restarted after configuration changes, such as adding new resources, or binary upgrade.

Use the **LSF\_LIM\_API\_NTRIES** parameter in the lsf.conf file or as an environment variable to define how many times LSF commands retry to communicate with the LIM API while LIM is not available. The **LSF\_LIM\_API\_NTRIES** parameter is ignored by LSF and EGO daemons and all EGO commands.

When the **LSB\_API\_VERBOSE=Y** parameter is set in the lsf.conf file, LSF batch commands display the not responding retry error message to stderr when LIM is not available.

When the **LSB\_API\_VERBOSE=N** parameter is set in the lsf.conf file, LSF batch commands do not display the retry error message when LIM is not available.

## Batch 命令客户端信息

LSF displays error messages when a batch command cannot communicate with the **mbatchd** daemon. The following table provides a list of possible error reasons and the associated error message output.

| Point of failure                                                                                                       | Possible reason                                                                                             | Error message output                         |
| ---------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | -------------------------------------------- |
| Establishing a connection with the **mbatchd** daemon                                                                  | The **mbatchd** daemon is too busy to accept new connections. The connect() system call times out.          | LSF is processing your request. Please wait… |
| The **mbatchd** daemon is down or no process is listening at either the **LSB\_MBD\_PORT** or the **LSB\_QUERY\_PORT** | LSF is down. Please wait…                                                                                   |                                              |
| The **mbatchd** daemon is down and the **LSB\_QUERY\_PORT** is busy                                                    | **bhosts** displays LSF is down. Please wait. . .**bjobs** displays Cannot connect to LSF. Please wait…     |                                              |
| Socket error on the client side                                                                                        | Cannot connect to LSF. Please wait…                                                                         |                                              |
| connect() system call fails                                                                                            | Cannot connect to LSF. Please wait…                                                                         |                                              |
| Internal library error                                                                                                 | Cannot connect to LSF. Please wait…                                                                         |                                              |
| Send/receive handshake message to/from the **mbatchd** daemon                                                          | The **mbatchd** daemon is busy. Client times out when LSF is waiting to receive a message from **mbatchd**. | LSF is processing your request. Please wait… |
| Socket read()/write() fails                                                                                            | Cannot connect to LSF. Please wait…                                                                         |                                              |
| Internal library error                                                                                                 | Cannot connect to LSF. Please wait…                                                                         |                                              |

## EGO 命令信息

You cannot run the egosh command because the administrator has chosen not to enable EGO in lsf.conf: LSF\_ENABLE\_EGO=N.

If EGO is not enabled, the **egosh** command cannot find the ego.conf file or cannot contact the **vemkd** daemon (likely because it is not started).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://baiyongan.gitbook.io/lsf-handbook/administrator_fundations/troubleshooting_lsf_problems/lsf_error_messages.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
