LSF Handbook
  • 前言
  • 迁移通知
  • Part I 入门介绍篇
    • Chapter 1 LSF 介绍
      • 1.1 LSF 简介
      • 1.2 LSF 系统要求与兼容性
        • 操作系统支持
        • 主机选择
        • 服务器主机兼容性
        • 附加兼容性
        • API 兼容性
      • 1.3 局限性
      • 1.4 版本更新说明
      • 1.5 LSF 快速上手
  • Chapter 2 安装、升级与迁移
    • [2.1 在 UNIX 与 Linux 上安装](chapter2/section1/Install on UNIX and Linux.md)
      • [安装目录结构示意](chapter2/section1/Example installation directory structure.md)
      • [规划安装](chapter2/section1/Planning your installation.md)
        • [LSF 集群中的 EGO](chapter2/section1/EGO in the LSF cluster.md)
        • [主节点选择](chapter2/section1/Master host selection.md)
      • [准备系统进行安装](chapter2/section1/Preparing your systems for installation.md)
      • [安装新的 LSF 集群 (lsfinstall)](chapter2/section1/Installing a new LSF cluster.md)
      • [从 IBM Fix Central 中获取修订](chapter2/section1/Getting fixes from IBM Fix Central.md)
      • [配置集群](chapter2/section1/Configuring a cluster.md)
      • [以非 root 用户安装 LSF](chapter2/section1/If you install LSF as a non-root user.md)
      • [往集群中添加节点](chapter2/section1/Adding hosts to the cluster.md)
      • [LSF HPC 特征](chapter2/section1/LSF HPC features.md)
        • [可选的LSF HPC 功能配置](chapter2/section1/Optional LSF HPC features configuration.md)
      • [注册服务端口](chapter2/section1/Registering service ports.md)
      • install.config 文件
      • slave.config 文件
    • [2.2 在 Windows 上安装](chapter2/section2/Install on Windows.md)
      • [安装目录结构示意](chapter2/section2/Example installation directory structures.md)
      • [LSF 集群中的 EGO](chapter2/section2/EGO in the LSF cluster.md)
      • [准备系统进行安装](chapter2/section2/Planning and preparing your systems for installation.md)
        • [主节点选择](chapter2/section2/Master host selection.md)
        • [Entitlement files 文件](chapter2/section2/Entitlement files.md)
      • [安装新的 LSF 集群](chapter2/section2/Installing a new LSF cluster.md)
      • [安装参数快速参考](chapter2/section2/Installation parameter quick reference.md)
    • [2.3 使用 IBM Spectrum Cluster Foundation 安装 LSF](chapter2/section3/Install LSF with IBM Spectrum Clus
      • [使用 IBM Spectrum Cluster Foundation安装 LSF Suites 10.1.1](chapter2/section3/Install LSF Suites 10.1.1
        • 安装
          • [安装规划](chapter2/section3/subsection1/Installation planning.md)
            • [预安装路线图](chapter2/section3/subsection1/Preinstallation roadmap.md)
            • [安装路线图](chapter2/section3/subsection1/Installation roadmap.md)
          • [准备安装适用于工作组的IBM Spectrum LSF Suite或适用于 HPC 的 IBM Spectrum LSF Suite](chapter2/section3/subsection1/P
            • 要求
            • [配置和测试开关](chapter2/section3/subsection1/Configure and test switches.md)
            • [规划您的网络配置](chapter2/section3/subsection1/Plan your network configuration.md)
            • [在管理节点上安装和验证操作系统](chapter2/section3/subsection1/Installing and verifying the operating system on the
          • [执行安装](chapter2/section3/subsection1/Performing an installation.md)
            • [安装方法的对比](chapter2/section3/subsection1/Comparing installation methods.md)
            • [交互式安装路线图](chapter2/section3/subsection1/Interactive installation roadmaps.md)
              • [快速安装路线图](chapter2/section3/subsection1/Quick installation roadmap.md)
              • [自定义安装](chapter2/section3/subsection1/Custom installation.md)
            • [使用安装程序执行交互式安装](chapter2/section3/subsection1/Performing an interactive installation using the insta
          • [执行静默安装](chapter2/section3/subsection1/Performing a silent installation.md)
          • [验证安装](chapter2/section3/subsection1/Verifying the installation.md)
          • [安装后的第一步](chapter2/section3/subsection1/Taking the first steps after installation.md)
            • [安装后添加计算节点](chapter2/section3/subsection1/Adding compute nodes after installation.md)
              • [导入计算节点](chapter2/section3/subsection1/Import compute nodes.md)
              • [发现计算节点](chapter2/section3/subsection1/Discover compute nodes.md)
          • [解决安装问题](chapter2/section3/subsection1/Troubleshooting installation problems.md)
        • [集群部署](chapter2/section3/subsection2/Cluster deployment.md)
          • [安装后创建 LSF 工作组集群](chapter2/section3/subsection2/Creating an LSF Workgroups cluster after installatio
          • [安装后创建 LSF HPC 集群](chapter2/section3/subsection2/Creating an LSF HPC cluster after installation.md)
        • [设置高可用性环境](chapter2/section3/subsection3/Setting up a high availability environment.md)
          • [高可用性要求](chapter2/section3/subsection3/High availability requirements.md)
            • [准备共享文件系统](chapter2/section3/subsection3/Prepare a shared file system.md)
          • [准备高可用性](chapter2/section3/subsection3/Preparing high availability.md)
          • [启用高可用性环境](chapter2/section3/subsection3/Enable a high availability environment.md)
          • [验证高可用性环境](chapter2/section3/subsection3/Verifying a high availability environment.md)
          • [对高可用性环境启用进行故障排除](chapter2/section3/subsection3/Troubleshooting a high availability environment enab
        • [安装之后](chapter2/section3/subsection4/After installation.md)
          • [管理集群](chapter2/section3/subsection4/Managing a cluster.md)
            • [添加或删除服务器](chapter2/section3/subsection4/Add or remove servers.md)
            • [同步集群](chapter2/section3/subsection4/Synchronizing a cluster.md)
            • [删除集群](chapter2/section3/subsection4/Deleting a cluster.md)
            • [维护模式](chapter2/section3/subsection4/Maintenance mode.md)
            • [健康检查](chapter2/section3/subsection4/Health check.md)
          • [升级集群](chapter2/section3/subsection4/Upgrading a cluster.md)
            • [执行集群升级](chapter2/section3/subsection4/Performing a cluster upgrade.md)
            • [滚动或批量升级过程概述](chapter2/section3/subsection4/Rolling or bulk upgrade process overview.md)
        • [已知问题和局限性](chapter2/section3/subsection5/Known issues and limitations.md)
    • [2.4 升级和迁移](chapter2/section4/Upgrade and migrate.md)
      • [在UNIX和Linux上升级](chapter2/section4/subsection1/Upgrade on UNIX and Linux.md)
        • [在 UNIX 和 Linux 上升级 LSF](chapter2/section4/subsection1/Upgrade LSF on UNIX and Linux.md)
        • [从IBM Fix Central获取修订](chapter2/section4/subsection1/Getting fixes from IBM Fix Central.md)
      • [在Windows上迁移](chapter2/section4/subsection2/Migrate on Windows.md)
        • [在Windows上迁移LSF](chapter2/section4/subsection2/Migrate LSF on Windows.md)
  • Part II 基础操作篇
    • Chapter 3 用户操作基础
      • 3.1 LSF 概览
        • LSF 介绍
        • 集群组件
      • 3.2 LSF 细观
        • LSF 服务与进程
        • 集群通信方式
        • 容错
        • 安全
      • 3.3 作业负载管理
        • 作业生命周期
        • 作业提交
        • 作业调度
        • 节点选择
        • 作业运行环境
      • 3.4 启用 EGO 的 LSF
        • EGO 组件概览
        • 资源
        • LSF 资源共享
  • Chapter 4 管理员操作基础
    • 4.1 集群概览
      • 术语与概念
      • 集群特征
      • 文件系统、目录和文件
        • [示例目录结构](chapter4/section1/Example directory structures.md)
      • 重要的文件目录与配置文件
    • 4.2 使用 LSF
      • 开启、结束与重配置 LSF
        • [设置 LSF 环境](chapter4/section2/subsection1/Setting up the LSF environment.md)
        • [启动集群](chapter4/section2/subsection1/Starting your cluster.md)
        • [停止集群](chapter4/section2/subsection1/Stopping your cluster.md)
        • [重新配置集群](chapter4/section2/subsection1/Reconfiguring your cluster.md)
      • 检查 LSF 状态
        • [检查集群配置](chapter4/section2/subsection2/Check cluster configuration.md)
        • [检查集群状态](chapter4/section2/subsection2/Check cluster status.md)
        • [检查LSF批处理系统配置](chapter4/section2/subsection2/Check LSF batch system configuration.md)
        • [找出批处理系统状态](chapter4/section2/subsection2/Find out batch system status.md)
      • 运行作业
        • [提交批处理作业](chapter4/section2/subsection3/Submit batch jobs.md)
        • [显示作业状态](chapter4/section2/subsection3/Display job status.md)
        • [控制作业执行](chapter4/section2/subsection3/Control job execution.md)
        • [运行交互式任务](chapter4/section2/subsection3/Run interactive tasks.md)
        • [将应用程序与 LSF 集成](chapter4/section2/subsection3/Integrate your applications with LSF.md)
      • 管理用户、节点与队列
        • [使您的集群可供用户使用](chapter4/section2/subsection4/Making your cluster available to users.md)
        • [将主机节点添加到集群](chapter4/section2/subsection4/Adding a host to your cluster.md)
        • [从集群中移除主机节点](chapter4/section2/subsection4/Removing a host from your cluster.md)
        • [添加队列](chapter4/section2/subsection4/Adding a queue.md)
        • [移除队列](chapter4/section2/subsection4/Removing a queue.md)
      • 配置 LSF 启动
        • [允许 LSF 管理员启动 LSF 守护程序](chapter4/section2/subsection5/Allowing LSF administrators to start LSF daemo
        • [设置 LSF 自动启动](chapter4/section2/subsection5/Setting up automatic LSF startup.md)
      • 管理软件许可证及其他共享资源
    • 4.3 LSF 排错
      • 常见 LSF 问题
      • LSF 错误信息
  • Part III 作业调度篇
    • Chapter 5 作业调度管理
      • 5.1 关于 IBM Spectrum LSF
        • LSF 集群,作业与队列
        • 节点
        • LSF 守护程序
        • 批处理作业和任务
        • 主机类型和主机型号
        • 用户和管理员
        • 资源
        • 作业生命周期
      • 5.2 作业运行
        • bsub 提交作业
          • [将作业提交到特定队列](chapter5/section2/subsection1/About submitting a job to a specific queue.md)
          • [查看可用队列](chapter5/section2/subsection1/View available queues.md)
          • [将作业提交到队列](chapter5/section2/subsection1/Submit a job to a queue.md)
          • [提交与项目关联的作业 (bsub -P)](chapter5/section2/subsection1/Submit a job associated with a project.md)
          • [提交与用户组关联的作业 (bsub -G)](chapter5/section2/subsection1/Submit a job associated with a user group.md)
          • [提交有作业名的作业 (bsub -J)](chapter5/section2/subsection1/Submit a job with a job name.md)
          • [提交作业到服务类 (bsub -sla)](chapter5/section2/subsection1/Submit a job to a service class.md)
          • [在作业组下提交作业 (bsub -g)](chapter5/section2/subsection1/Submit a job under a job group.md)
          • [提交带有 JSON 文件的作业 (bsub -json)](chapter5/section2/subsection1/Submit a job with a JSON file.md)
          • [提交带有 YAML 文件的作业 (bsub -yaml)](chapter5/section2/subsection1/Submit a job with a YAML file.md)
          • [提交带有 JSDL 文件的作业 (bsub -jsdl)](chapter5/section2/subsection1/Submit a job with a JSDL file.md)
        • [修改正在等待的作业 (bmod)](chapter5/section2/subsection2/Modify pending jobs.md)
        • [修改正在运行的作业](chapter5/section2/subsection3/Modify running jobs.md)
        • [关于控制作业](chapter5/section2/subsection4/About controlling jobs.md)
          • [杀掉作业 (bkill)](chapter5/section2/subsection4/Kill a job.md)
          • [关于暂停和恢复作业 (bstop and bresume)](chapter5/section2/subsection4/About suspending and resuming jobs.md)
          • [将作业移到队列底部 (bbot)](chapter5/section2/subsection4/Move a job to the bottom of a queue.md)
          • [将作业移到队列顶部 (btop)](chapter5/section2/subsection4/Move a job to the top of a queue.md)
          • [控制作业组中的作业](chapter5/section2/subsection4/Control jobs in job groups.md)
          • [将作业提交给特定主机](chapter5/section2/subsection4/Submit a job to specific hosts.md)
          • [提交具有特定资源的作业](chapter5/section2/subsection4/Submit a job with specific resources.md)
          • [队列和主机首选项](chapter5/section2/subsection4/Queues and host preference.md)
          • [指定不同级别的主机首选项](chapter5/section2/subsection4/Specify different levels of host preference.md)
          • [提交具有资源需求的作业](chapter5/section2/subsection4/Submit a job with resource requirements.md)
          • [通过 SSH X11 转发提交作业](chapter5/section2/subsection4/Submit a job with SSH X11 forwarding.md)
        • [将 LSF 与非共享文件空间一起使用](chapter5/section2/subsection5/Using LSF with non-shared file space.md)
          • 操作符
        • [关于资源预约](chapter5/section2/subsection6/About resource reservation.md)
          • [查看资源信息](chapter5/section2/subsection6/View resource information.md)
          • [提交具有资源需求的作业](chapter5/section2/subsection6/Submit a job with resource requirements.md)
          • [提交有开始或终止时间的作业](chapter5/section2/subsection6/Submit a job with start or termination times.md)
          • [提交具有计算单元资源要求的作业](chapter5/section2/subsection6/Submit a job with compute unit resource requirements
        • [设置等待时间限制](chapter5/section2/subsection7/Set pending time limits.md)
      • 5.3 作业监控
        • [查看有关作业的信息](chapter5/section3/subsection1/View information about jobs.md)
          • [查看未完成的工作](chapter5/section3/subsection1/View unfinished jobs.md)
          • [查看未完成的作业的摘要信息](chapter5/section3/subsection1/View summary information of unfinished jobs.md)
          • [查看所有作业](chapter5/section3/subsection1/View all jobs.md)
          • [查看正在运行的作业](chapter5/section3/subsection1/View running jobs.md)
          • [查看在等待作业的原因](chapter5/section3/subsection1/View pending reasons for jobs.md)
          • [查看作业暂停原因](chapter5/section3/subsection1/View job suspending reasons.md)
          • [查看详细的作业信息](chapter5/section3/subsection1/View detailed job information.md)
          • [查看作业组信息](chapter5/section3/subsection1/View job group information.md)
          • [监测 SLA 进程](chapter5/section3/subsection1/Monitor SLA progress.md)
          • [查看作业输出](chapter5/section3/subsection1/View job output.md)
          • [查看作业的时间顺序历史](chapter5/section3/subsection1/View chronological history of jobs.md)
          • [查看未在活动事件日志中列出的作业历史](chapter5/section3/subsection1/View history of jobs not listed in active event l
          • [查看作业历史记录](chapter5/section3/subsection1/View job history.md)
          • [查看作业提交环境](chapter5/section3/subsection1/View the job submission environment.md)
          • [更新间隔](chapter5/section3/subsection1/Update interval.md)
          • [作业级别信息](chapter5/section3/subsection1/Job-level information.md)
        • [显示资源分配限制](chapter5/section3/subsection2/Display resource allocation limits.md)
          • [查看有关资源分配限制的信息](chapter5/section3/subsection2/View information about resource allocation limits.md)
  • Part IV 集群运维篇
    • Chapter 6 LSF 集群维护管理
      • [6.1 集群管理要点](chapter6/section1/Cluster management essentials.md)
        • [集群的使用](chapter6/section1/subsection1/Work with your cluster.md)
          • Viewing cluster information
          • Control daemons
          • Commands to reconfigure your cluster
          • Live reconfiguration
          • Adding cluster adminstrators
        • [主机节点的使用](chapter6/section1/subsection2/Working with hosts.md)
          • Host status
          • View host information
          • Control hosts
          • Connect to an execution host or container
          • Host names
        • [作业目录与数据](chapter6/section1/subsection3/Job directories and data.md)
          • Directory for job output
          • Specify a directory for job output
          • Temporary job directories
          • About flexible job CWD
          • About flexible job output directory
        • [作业通知](chapter6/section1/subsection4/Job notification.md)
          • Disable job email
          • Size of job email
      • [6.2 监视集群操作和运行状况](chapter6/section2/Monitoring cluster operations and health.md)
        • [监控集群性能](chapter6/section2/subsection1/Monitor cluster performance.md)
        • [监控作业信息](chapter6/section2/subsection2/Monitor job information.md)
        • [使用外部脚本监控应用](chapter6/section2/subsection3/Monitor applications by using external scripts.md)
        • [查看资源信息](chapter6/section2/subsection4/View resource information.md)
        • [查看用户和用户组的信息](chapter6/section2/subsection5/View user and user group information.md)
        • [查看队列信息](chapter6/section2/subsection6/View queue information.md)
      • [6.3 管理作业执行](chapter6/section3/Managing job execution.md)
        • [管理作业执行](chapter6/section3/subsection1/Managing job execution.md)
        • [作业文件假脱机](chapter6/section3/subsection2/Job file spooling.md)
        • [作业数据管理](chapter6/section3/subsection3/Job data management.md)
        • [作业调度与分配](chapter6/section3/subsection4/Job scheduling and dispatch.md)
        • [控制作业执行](chapter6/section3/subsection5/Control job execution.md)
        • [交互式作业和远程任务](chapter6/section3/subsection6/Interactive jobs and remote tasks.md)
      • [6.4 配置和共享工作资源](chapter6/section4/Configuring and sharing job resources.md)
        • [关于 LSF 资源](chapter6/section4/subsection1/About LSF resources.md)
        • [在 LSF 中代表作业资源](chapter6/section4/subsection2/Representing job resources in LSF.md)
        • [基于计划的调度与预留](chapter6/section4/subsection3/Plan-based scheduling and reservations.md)
        • [在 LSF 中向用户分配作业资源](chapter6/section4/subsection4/Distributing job resources to users in LSF.md)
      • [6.5 GPU 资源](chapter6/section5/GPU resources.md)
        • [启用 GPU 资源](chapter6/section5/subsection1/Enabling GPU features.md)
        • [监控 GPU 资源](chapter6/section5/subsection2/Monitoring GPU resources.md)
        • [提交和监视 GPU 作业](chapter6/section5/subsection3/Submitting and monitoring GPU jobs.md)
        • [使用 ELIM 的 GPU 功能](chapter6/section5/subsection4/GPU features using ELIM.md)
      • [6.6 配置容器](chapter6/section6/Configuring containers.md)
        • [LSF 与 Docker](chapter6/section6/subsection1/LSF with Docker.md)
        • [LSF 与 Shifter](chapter6/section6/subsection2/LSF with Shifter.md)
        • [LSF 与 Singularity](chapter6/section6/subsection3/LSF with Singularity.md)
      • [6.7 高吞吐量作业负载管理](chapter6/section7/High throughput workload administration.md)
        • [作业包](chapter6/section7/subsection1/Job packs.md)
        • [作业阵列](chapter6/section7/subsection2/Job arrays.md)
        • [公平共享调度](chapter6/section7/subsection3/Fairshare scheduling.md)
        • [有保证的资源池](chapter6/section7/subsection4/Guaranteed resource pools.md)
        • [保留内存和许可证资源](chapter6/section7/subsection5/Reserving memory and license resources.md)
      • [6.8 并行作业负载管理](chapter6/section8/Parallel workload administration.md)
        • [运行并行作业](chapter6/section8/subsection1/Running parallel jobs.md)
        • [提前预定](chapter6/section8/subsection2/Advance reservation.md)
        • [公平共享调度](chapter6/section8/subsection3/Fairshare scheduling.md)
        • [作业检查点与重启动](chapter6/section8/subsection4/Job checkpoint and restart.md)
        • [可检查和可重新运行作业的作业迁移](chapter6/section8/subsection5/Job migration for checkpointable and rerunnable job
        • [可调整作业](chapter6/section8/subsection6/Resizable jobs.md)
      • [6.9 LSF 中的安全性](chapter6/section9/Security in LSF.md)
        • [安全注意事项](chapter6/section9/subsection1/Security considerations.md)
        • [保证 LSF 集群的安全](chapter6/section9/subsection2/Secure your LSF cluster.md)
      • [6.10 进阶设定](chapter6/section10/Advanced configuration.md)
        • [错误与事件记录](chapter6/section10/subsection1/Error and event logging.md)
        • [事件产生](chapter6/section10/subsection2/Event generation.md)
        • [自定义批处理命令消息](chapter6/section10/subsection3/Customize batch command messages.md)
        • [LIM 如何确定主机型号与类型](chapter6/section10/subsection4/How LIM determines host models and types.md)
        • [共享文件访问](chapter6/section10/subsection5/Shared file access.md)
        • [共享的配置文件](chapter6/section10/subsection6/Shared configuration file content.md)
        • [认证与授权](chapter6/section10/subsection7/Authentication and authorization.md)
        • [处理作业异常](chapter6/section10/subsection8/Handle job exceptions.md)
        • [调节 CPU 参数](chapter6/section10/subsection9/Tune CPU factors.md)
        • [为完成的作业设置清理周期](chapter6/section10/subsection10/Set clean period for DONE jobs.md)
        • [启用基于主机的资源](chapter6/section10/subsection11/Enable host-based resources.md)
        • [全局公平共享调度](chapter6/section10/subsection12/Global fairshare scheduling.md)
        • [在 EGO 中管理 LSF](chapter6/section10/subsection13/Manage LSF on EGO.md)
        • [负载共享 X 应用](chapter6/section10/subsection14/Load sharing X applications.md)
        • [将 LSF 与 Etnus TotalView 调试器一起使用](chapter6/section10/subsection15/Using LSF with the Etnus TotalView
        • [将 LSF 主机名和 IP 地址注册到 LSF 服务器](chapter6/section10/subsection16/Register LSF host names and IP address
      • [6.11 性能调优](chapter6/sectio11/Performance tuning.md)
        • [对集群进行调优](chapter6/section11/subsection1/Tune your cluster.md)
        • [实现性能和可扩展性](chapter6/section11/subsection2/Achieve performance and scalability.md)
      • [6.12 能量感知调度](chapter6/section12/Energy aware scheduling.md)
        • [管理主机电源状态](chapter6/section12/subsection1/Managing host power states.md)
        • [CPU 频率管理](chapter6/section12/subsection2/CPU frequency management.md)
        • [自动 CPU 频率选择](chapter6/section12/subsection3/Automatic CPU frequency selection.md)
      • [6.13 LSF 多集群功能](chapter6/section13/LSF multicluster capability.md)
        • [LSF 多集群功能概述](chapter6/section13/subsection1/Overview of LSF multicluster capability.md)
        • [设置 LSF 多集群功能](chapter6/section13/subsection2/Set up LSF multicluster capability.md)
        • [作业转发模型](chapter6/section13/subsection3/Job forwarding model.md)
        • [资源租赁模型](chapter6/section13/subsection4/Resource leasing model.md)
      • [6.14 LSF 高级版](chapter6/section14/LSF Advanced Edition.md)
        • [LSF 高级版概述](chapter6/section14/subsection1/Overview of LSF Advanced Edition.md)
        • [设置 LSF 高级版](chapter6/section14/subsection2/Set up LSF Advanced Edition.md)
        • [配置 LSF Advanced Edition 功能](chapter6/section14/subsection3/Configure LSF Advanced Edition features.
        • [使用 LSF 高级版](chapter6/section14/subsection4/Using LSF Advanced Edition.md)
        • [LSF 高级版参考](chapter6/section14/subsection5/Reference for LSF Advanced Edition.md)
  • Chapter 7 参考文档
    • 7.1 命令参考
      • bacct
      • badmin
      • bapp
      • battach
      • battr
      • bbot
      • bchkpnt
      • bclusters
      • bconf
      • bdata
        • Synopsis
        • Subcommands
          • cache
          • chgrp
          • chmod
          • tags
          • showconf
          • connections
          • admin
        • [Help and version display](chapter7/section1/subsection1/Help and version display.md)
        • [See also](chapter7/section1/subsection1/See also.md)
      • bentags
      • bgadd
      • bgdel
      • bgmod
      • bgpinfo
      • bhist
      • bhosts
      • bhpart
      • bimages
      • bjdepinfo
      • bjgroup
      • bjobs
        • Categories
          • Category: filter
          • Category: format
          • Category: state
        • Options
          • -A
          • -a
          • -aff
          • -app
          • -aps
          • -cname
          • -d
          • -data
          • -env
          • -fwd
          • -G
          • -g
          • -gpu
          • -hms
          • -hostfile
          • -Jd
          • -json
          • -Lp
          • -l
          • -m
          • -N
          • -noheader
          • -o
          • -P
          • -p
          • -pe
          • -pei
          • -pi
          • -plan
          • -prio
          • -psum
          • -q
          • -r
          • -rusage
          • -s
          • -script
          • -sla
          • -ss
          • -sum
          • -U
          • -UF
          • -u
          • -W
          • -WF
          • -WL
          • -WP
          • -w
          • -X
          • -x
          • job_id
          • -h
          • -V
        • Description
      • bkill
      • bladmin
      • blaunch
      • blcollect
      • blcstat
      • blhosts
      • blimits
      • blinfo
      • blkill
      • blparams
      • blstat
      • bltasks
      • blusers
      • bmgroup
      • bmig
      • bmod
      • bparams
      • bpeek
      • bpost
      • bqueues
      • bread
      • brequeue
      • bresize
      • bresources
      • brestart
      • bresume
      • brlainfo
      • brsvadd
      • brsvdel
      • brsvjob
      • brsvmod
      • brsvs
      • brsvsub
      • brun
      • bsla
      • bslots
      • bstage
        • bstage in
        • bstage out
        • [Help and version display](chapter7/section1/subsection3/Help and version display.md)
        • [See also](chapter7/section1/subsection3/See also.md)
      • bstatus
      • bstop
      • bsub
        • Categories
          • Category: io
          • Category: limit
          • Category: notify
          • Category: pack
          • Category: properties
          • Category: resource
          • Category: schedule
          • Category: script
        • Options
          • -a
          • -alloc_flags
          • -app
          • -ar
          • -B
          • -b
          • -C
          • -c
          • -clusters
          • -cn_cu
          • -cn_mem
          • -core_isolation
          • -csm
          • -cwd
          • -D
          • -data
          • -datachk
          • -datagrp
          • -E
          • -Ep
          • -e
          • -env
          • -eo
          • -eptl
          • -ext
          • -F
          • -f
          • -freq
          • -G
          • -g
          • -gpu
          • -H
          • -hl
          • -hostfile
          • -I
          • -Ip
          • -IS
          • -ISp
          • -ISs
          • -Is
          • -IX
          • -i
          • -is
          • -J
          • -Jd
          • -jobaff
          • -jsdl
          • -jsdl_strict
          • -jsm
          • -json
          • -K
          • -k
          • -L
          • -Lp
          • -ln_mem
          • -ln_slots
          • -M
          • -m
          • -mig
          • -N
          • -Ne
          • -n
          • -notify
          • -network
          • -nnodes
          • -o
          • -oo
          • -outdir
          • -P
          • -p
          • -pack
          • -ptl
          • -Q
          • -q
          • -R
          • -r
          • -rn
          • -rnc
          • -S
          • -s
          • -sla
          • -smt
          • -sp
          • -stage
          • -step_cgroup
          • -T
          • -t
          • -ti
          • -tty
          • -U
          • -u
          • -ul
          • -v
          • -W
          • -We
          • -w
          • -wa
          • -wt
          • -XF
          • -x
          • -yaml
          • -Zs
          • command
          • job_script
          • LSB_DOCKER_PLACE_HOLDER
          • -h
          • -V
        • Description
      • bswitch
      • btop
      • bugroup
      • busers
      • bwait
      • ch
      • gpolicyd
      • lim
      • lsacct
      • lsacctmrg
      • lsadmin
      • lsclusters
      • lseligible
      • lsfinstall
      • lsfmon
      • lsfrestart
      • lsfshutdown
      • lsfstartup
      • lsgrun
      • lshosts
      • lsid
      • lsinfo
      • lsload
      • lsloadadj
      • lslogin
      • lsltasks
      • lsmake
      • lsmon
      • lspasswd
      • lsplace
      • lsportcheck
      • lsrcp
      • lsreghost (UNIX)
      • lsreghost (Windows)
      • lsrtasks
      • lsrun
      • lstcsh
      • pam
      • patchinstall
      • pversions (UNIX)
      • pversions (Windows)
      • ssacct
      • ssched
      • taskman
      • tspeek
      • tssub
      • wgpasswd
      • wguser
    • 7.2 配置参考
      • 配置文件
        • cshrc.lsf and profile.lsf
        • hosts
        • install.config
        • lim.acct
        • lsb.acct
        • lsb.applications
        • lsb.events
        • lsb.globalpolicies
        • lsb.hosts
        • lsb.modules
        • lsb.params
        • lsb.queues
        • lsb.reasons
        • lsb.resources
        • lsb.serviceclasses
        • lsb.threshold
        • lsb.users
        • lsf.acct
        • lsf.cluster
        • lsf.conf
        • lsf.datamanager
        • lsf.licensescheduler
        • lsf.shared
        • lsf.sudoers
        • lsf.task
        • setup.config
        • slave.config
      • 环境变量
        • [为作业执行而设置的环境变量](chapter7/section2/subsection2/Environment variables set for job execution.md)
        • [调整大小通知命令的环境变量](chapter7/section2/subsection2/Environment variables for resize notification command.
        • [会话调度程序的环境变量](chapter7/section2/subsection2/Environment variables for session scheduler.md)
        • [用于数据来源的环境变量](chapter7/section2/subsection2/Environment variables for data provenance.md)
        • [环境变量参考](chapter7/section2/subsection2/Environment variable reference.md)
    • 7.3 API 参考
  • Part V 功能拓展篇
    • Chapter 8 LSF 拓展
      • LSF Session Scheduler
      • LSF with Rational ClearCase
      • LSF on Cray
      • LSF with Apache Spark
      • LSF with Apache Hadoop
      • LSF with Cluster Systems Manager
      • LSF with IBM Cloud Private
      • LSF Job Step Manager
      • Submitting jobs using JSDL
      • LSF Simulator
  • Chapter 9 最佳实践与建议
    • [Accounting file management](https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/best_practice
    • Allocating CPUs as blocks for parallel jobs
    • [Cleaning up parallel job execution problems](https://www.ibm.com/support/knowledgecenter/SSWRJV_10.
    • Configuring IBM Aspera as a data transfer tool
    • [Customizing job query output format](https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/best
    • [Defining external host-based resources](https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/b
    • [Enforcing job memory and swap with Linux cgroups](https://www.ibm.com/support/knowledgecenter/SSWRJ
    • [Job access control](https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/best_practices/Job ac
    • Integration with AFS
    • [Maintaining cluster performance](https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/best_pra
    • Managing floating software licenses
    • [Optimizing LSF job processing with CPU frequency governors enabled](https://www.ibm.com/support/kno
    • OS partitioning and virtualization on Oracle Solaris and IBM AIX
    • [Placing jobs based on available job slots of hosts](https://www.ibm.com/support/knowledgecenter/SSW
    • [Running checksum to verify installation images](https://www.ibm.com/support/knowledgecenter/SSWRJV_
    • [Tracking job dependencies](https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/best_practices
    • [Understanding mbatchd performance metrics](https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.
    • [Using compute units for topology scheduling](https://www.ibm.com/support/knowledgecenter/SSWRJV_10.
    • [Using job directories](https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/best_practices/Usi
    • [Using lsmake to accelerate Android builds](https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.
    • [Using NVIDIA DGX systems with LSF](https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/best_p
    • Using ssh X11 forwarding
    • [Using the Python wrapper for LSF API](https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/bes
  • [Chapter 10 LSF 许可证调度程序](chapter10/LSF License Scheduler.md)
    • 介绍
      • 概览
      • [LSF License Scheduler 版本之间的差异](chapter10/section1/Differences between LSF License Scheduler edition
      • 词汇表
      • 架构
    • [安装和启动许可证调度程序](chapter10/section2/Installing and starting License Scheduler.md)
      • [安装 License Scheduler](chapter10/section2/subsection1/Install License Scheduler.md)
        • [安装之前](chapter10/section2/subsection1/Before you install.md)
        • [License Scheduler 设置脚本的作用](chapter10/section2/subsection1/What the License Scheduler setup script d
        • [使用 LSF (UNIX) 安装 License Scheduler](chapter10/section2/subsection1/Install License Scheduler with L
        • [在 Windows 上安装 License Scheduler](chapter10/section2/subsection1/Install License Scheduler on Window
          • [使用 LSF (Windows) 安装 License Scheduler](chapter10/section2/subsection1/Install License Scheduler wit
        • 故障排除
        • [配置 LSF License Scheduler 基础版](chapter10/section2/subsection1/Configure LSF License Scheduler Basic
      • [启动 License Scheduler](chapter10/section2/Start License Scheduler.md)
      • [License Scheduler 中的 LSF 参数](chapter10/section2/LSF parameters in License Scheduler.md)
      • [关于提交作业](chapter10/section2/About submitting jobs.md)
      • [配置更改之后](chapter10/section2/After configuration changes.md)
      • [将集群添加到 License Scheduler](chapter10/section2/Add a cluster to License Scheduler.md)
      • [配置多个管理员](chapter10/section2/Configure multiple administrators.md)
      • [升级 License Scheduler](chapter10/section2/Upgrade License Scheduler.md)
      • 防火墙
    • [LSF 许可证调度程序概念](chapter10/section3/LSF License Scheduler concepts.md)
      • [License Scheduler 模式](chapter10/section3/License Scheduler modes.md)
      • [项目组](chapter10/section3/Project groups.md)
      • [License Scheduler 中的服务域](chapter10/section3/Service domains in License Scheduler.md)
      • [发行政策](chapter10/section3/Distribution policies.md)
      • [项目模式抢占](chapter10/section3/subsection1/Project mode preemption.md)
        • [抢占限制](chapter10/section3/subsection1/Preemption restrictions.md)
        • [LSF 抢占与 License Scheduler 抢占](chapter10/section3/subsection1/LSF preemption with License Scheduler
      • [FlexNet 和 Reprise License Manager 的许可证使用情况](chapter10/section3/subsection2/License usage with FlexN
        • [已知许可要求](chapter10/section3/subsection2/Known license requirements.md)
        • [未知许可要求](chapter10/section3/subsection2/Unknown license requirements.md)
        • [项目模式](chapter10/section3/subsection2/Project mode.md)
        • [集群模式](chapter10/section3/subsection2/Cluster mode.md)
        • [保留的 FlexNet Manager 许可证](chapter10/section3/subsection2/Reserved FlexNet Manager licenses.md)
    • [配置许可证调度程序](chapter10/section4/Configuring License Scheduler.md)
      • [配置集群模式](chapter10/section4/Configure cluster mode.md)
      • [保证配置集群模式](chapter10/section4/Configure cluster mode with guarantees.md)
      • [项目模式与项目](chapter10/section4/Project mode with projects.md)
      • [项目模式可选设置](chapter10/section4/Project mode optional settings.md)
      • [项目组的项目模式](chapter10/section4/Project mode with project groups.md)
      • [配置快速调度项目模式](chapter10/section4/subsection1/Configure fast dispatch project mode.md)
        • [配置 lmremove 或 rlmremove 抢占](chapter10/section4/subsection1/Configure lmremove or rlmremove preempti
      • [基于时间的自动配置](chapter10/section4/Automatic time-based configuration.md)
      • 故障转移
        • [局域网的故障转移配置](chapter10/section4/subsection2/Failover provisioning for LANs.md)
        • [外网的故障转移配置](chapter10/section4/subsection2/Failover provisioning for WANs.md)
          • [在 WAN 中配置并启动 License Scheduler](chapter10/section4/subsection2/Configure and start License Schedule
          • [WAN 示例](chapter10/section4/subsection2/WAN example.md)
          • [主机和网络级别的服务供应](chapter10/section4/subsection2/Service provisioning at the host and network levels.md
        • [设置 fod](chapter10/section4/subsection2/Set up fod.md)
      • [用户认证](chapter10/section4/User authentication.md)
    • [查看信息和故障排除](chapter10/section5/Viewing information and troubleshooting.md)
      • [关于查看可用许可证](chapter10/section5/subsection1/About viewing available licenses.md)
        • [查看传递给作业的许可证服务器和许可证功能信息](chapter10/section5/subsection1/View license server and license feature info
        • [自定义动态许可证信息输出](chapter10/section5/subsection1/Customize dynamic license information output.md)
      • [关于错误日志](chapter10/section5/subsection2/About error logs.md)
        • [管理日志文件](chapter10/section5/subsection2/Manage log files.md)
        • [临时更改日志级别](chapter10/section5/subsection2/Temporarily change the log level.md)
      • 故障排除
        • [文件位置](chapter10/section5/subsection3/File locations.md)
        • [检查 blstat 是否支持 lmstat](chapter10/section5/subsection3/Check that lmstat is supported by blcollect.m
        • [除非您定义了 LSF License Scheduler elim,否则不要删除lsb.tokens](chapter10/section5/subsection3/Do not delete ls
    • 参考
      • lsf.licensescheduler
      • bladmin
      • blcollect
      • blcstat
      • blhosts
      • blinfo
      • blkill
      • blparams
      • blstat
      • bltasks
      • blusers
      • fod.conf
      • fodadmin
      • fodapps
      • fodhosts
      • fodid
      • taskman
  • Part VI 经验总结篇
    • Chapter 11
    • Chapter 12
  • 后记
  • 附录
  • 参考资料
由 GitBook 提供支持
在本页
  • 查找 LSF 错误日志
  • 步骤
  • 诊断和修复大多数 LSF 问题
  • 步骤
  • 无法打开 lsf.conf 文件
  • 任务说明
  • 步骤
  • LIM 无响应地挂掉
  • 步骤
  • LIM 通信超时
  • 任务说明
  • 步骤
  • 主 LIM 挂掉
  • 任务说明
  • 步骤
  • 用户权限被拒绝
  • 步骤
  • 由于文件名空间不一致,远程执行失败
  • 任务说明
  • 步骤
  • 批处理守护程序无响应挂掉
  • 任务说明
  • 步骤
  • sbatchd 启动,但是 mbatchd 没有启动
  • 步骤
  • 避免孤立的作业流程
  • 任务说明
  • 步骤
  • LSF 未使用主机
  • 任务说明
  • 步骤
  • 未知的主机类型或型号
  • 步骤
  • 默认主机类型或型号
  • 步骤

这有帮助吗?

  1. Chapter 4 管理员操作基础
  2. 4.3 LSF 排错

常见 LSF 问题

大多数问题是由于错误的安装或配置引起的。 在开始对 LSF 问题进行故障排除之前,请始终先检查错误日志文件。 日志消息通常直接指出问题所在。

查找 LSF 错误日志

When something goes wrong, LSF server daemons log error messages in the LSF log directory (specified by the LSF_LOGDIR parameter in the lsf.conf file).

步骤

Make sure that the primary LSF administrator owns LSF_LOGDIR, and that root can write to this directory.

If an LSF server is unable to write to LSF_LOGDIR, then the error logs are created in /tmp. LSF logs errors to the following files:

  • lim.log.host_name

  • res.log.host_name

  • pim.log.host_name

  • mbatchd.log.master_host

  • mbschd.log.master_host

  • sbatchd.log.host_name

  • vemkd.log.master_host

If these log files contain any error messages that you do not understand, contact IBM Support.

诊断和修复大多数 LSF 问题

General troubleshooting steps for most LSF problems.

步骤

  1. Run the lsadmin ckconfig -v command and note any errors that are shown in the command output.

    Look for the error in one of the problems described in this section. If none of these troubleshooting steps applies to your situation, contact IBM Support.

  2. Use the following commands to restart the LSF cluster:

    # lsadmin limrestart all
    # lsadmin resrestart all
    # badmin hrestart all
  3. Run the ps -ef command to see whether the LSF daemons are running.

    Look for the processes similar to the following command output:

    root 17426     1  0   13:30:40 ?    0:00 /opt/lsf/cluster1/10.1/sparc-sol10/etc/lim
    root 17436     1  0   13:31:11 ?    0:00 /opt/lsf/cluster1/10.1/sparc-sol10/etc/sbatchd
    root 17429     1  0   13:30:56 ?    0:00 /opt/lsf/cluster1/10.1/sparc-sol10/etc/res
  4. Check the LSF error logs on the first few hosts that are listed in the Host section of the LSF_CONFDIR/lsf.cluster.cluster_name file.

    If the LSF_MASTER_LIST parameter is defined in the LSF_CONFDIR/lsf.conf file, check the error logs on the hosts that are listed in this parameter instead.

无法打开 lsf.conf 文件

You might see this message when you run the lsid file. The message usually means that the LSF_CONFDIR/lsf.conf file is not accessible to LSF.

任务说明

By default, LSF checks the directory that is defined by the LSF_ENVDIR parameter for the lsf.conf file. If the lsf.conf file is not in LSF_ENVDIR, LSF looks for it in the /etc directory.

步骤

  • Make sure that a symbolic link exists from /etc/lsf.conf to lsf.conf

  • Use the csrhc.lsf or profile.lsf script to set up your LSF environment.

  • Make sure that the cshrc.lsf or profile.lsf script is available for users to set the LSF environment variables.

LIM 无响应地挂掉

When the LSF LIM daemon exits unexpectedly, check for errors in the LIM configuration files.

步骤

Run the following commands:

lsadmin ckconfig -v

This command displays most configuration errors. If the command does not report any errors, check in the LIM error log.

LIM 通信超时

Sometimes the LIM is up, but running the lsload command prints the following error message:Communication time out.

任务说明

If the LIM just started, LIM needs time to get initialized by reading configuration files and contacting other LIMs. If the LIM does not become available within one or two minutes, check the LIM error log for the host you are working on.

To prevent communication timeouts when the local LIM is starting or restarting, define the parameter LSF_SERVER_HOSTS in the lsf.conf file. The client contacts the LIM on one of the LSF_SERVER_HOSTS and runs the command. At least one of the hosts that are defined in the list must have a LIM that is up and running.

When the local LIM is running but the cluster has no master, LSF applications display the following message:

Cannot locate master LIM now, try later.

步骤

Check the LIM error logs on the first few hosts that are listed in the Host section of the lsf.cluster.cluster_name file. If the LSF_MASTER_LIST parameter is defined in the lsf.conf file, check the LIM error logs on the hosts that are listed in this parameter instead.

主 LIM 挂掉

Sometimes the master LIM is up, but running the lsload or lshosts command displays the following error message: Master LIM is down; try later.

任务说明

If the /etc/hosts file on the host where the master LIM is running is configured with the host name that is assigned to the loopback IP address (127.0.0.1), LSF client LIMs cannot contact the master LIM. When the master LIM starts up, it sets its official host name and IP address to the loopback address. Any client requests get the master LIM address as 127.0.0.1, and try to connect to it, and in fact tries to access itself.

步骤

Check the IP configuration of your master LIM in /etc/hosts.

The following example incorrectly sets the master LIM IP address to the loopback address:

127.0.0.1   localhost   myhostname

The following example correctly sets the master LIM IP address:

127.0.0.1    localhost
192.168.123.123   myhostname

For a master LIM running on a host that uses an IPv6 address, the loopback address is

::1

The following example correctly sets the master LIM IP address by using an IPv6 address:

::1    localhost ipv6-localhost ipv6-loopback 

fe00::0     ipv6-localnet 

ff00::0     ipv6-mcastprefix
ff02::1     ipv6-allnodes
ff02::2     ipv6-allrouters
ff02::3     ipv6-allhosts

用户权限被拒绝

If the remote host cannot securely determine the user ID of the user that is requesting remote execution, remote execution fails with the following error message: User permission denied..

步骤

  1. Check the RES error log on the remote host for more detailed error message.

  2. If you do not want to configure an identification daemon (LSF_AUTH in lsf.conf), all applications that do remote executions must be owned by root with the setuid bit set. Run the following command:

    chmod 4755 filename
  3. If the application binary files are on an NFS-mounted file system, make sure that the file system is not mounted with the nosuid flag.

  4. If you are using an identification daemon (the LSF_AUTH parameter in the lsf.conf file), the inetd daemon must be configured. The identification daemon must not be run directly.

  5. Inconsistent host names in a name server with /etc/hosts and /etc/hosts.equiv can also cause this problem. If the LSF_USE_HOSTEQUIV parameter is defined in the lsf.conf file, check that the /etc/hosts.equiv file or the HOME/.rhosts file on the destination host has the client host name in it.

  6. For Windows hosts, users must register and update their Windows passwords by using the lspasswd command. Passwords must be 3 characters or longer, and 31 characters or less.

    For Windows password authentication in a non-shared file system environment, you must define the parameter LSF_MASTER_LIST in the lsf.conf file so that jobs run with correct permissions. If you do not define this parameter, LSF assumes that the cluster uses a shared file system environment.

由于文件名空间不一致,远程执行失败

A non-uniform file name space might cause a command to fail with the following error message: chdir(...) failed: no such file or directory.

任务说明

You are trying to run a command remotely, but either your current working directory does not exist on the remote host, or your current working directory is mapped to a different name on the remote host.

If your current working directory does not exist on a remote host, do not run commands remotely on that host.

步骤

  • If the directory exists, but is mapped to a different name on the remote host, you must create symbolic links to make them consistent.

  • LSF can resolve most, but not all, problems by using automount. The automount maps must be managed through NIS.

    Contact IBM Support if you are running automount and LSF is not able to locate directories on remote hosts.

批处理守护程序无响应挂掉

When the LSF batch daemons sbatchd and mbatchd exit unexpectedly, check for errors in the configuration files.

任务说明

If the mbatchd daemon is running but the sbatchd daemon dies on some hosts, it might be because mbatchd is not configured to use those hosts.

步骤

  • Check the sbatchd and mbatchd daemon error logs.

  • Run the badmin ckconfig command to check the configuration.

  • Check for email in the LSF administrator mailbox.

sbatchd 启动,但是 mbatchd 没有启动

When the sbatchd daemon starts but the mbatchd daemon is not running, it is possible that mbatchd is temporarily unavailable because the master LIM is temporarily unknown. The following error message is displayed: sbatchd: unknown service.

步骤

  1. Run the lsid command to check whether LIM is running.

    If LIM is not running properly, follow the steps in the following topics to fix LIM problems:

  2. Check whether services are registered properly.

避免孤立的作业流程

LSF uses process groups to track all the processes of a job. However, if the application forks a child, the child becomes a new process group. The parent dies immediately, and the child process group is orphaned from the parent process, and cannot be tracked.

任务说明

步骤

  1. When a job is started, the application runs under the job RES or root process group.

  2. If an application creates a new process group, and its parent process ID (PPID) still belongs to the job, PIM can track this new process group as part of the job.

    The only reliable way to not lose track of a process is to prevent it from using a new process group. Any process that daemonizes itself is lost when child processes are orphaned from the parent process group because it changes its process group right after it is detached.

LSF 未使用主机

The mbatchd daemon allows the sbatchd daemon to run only on the hosts that are listed in the Host section of the lsb.hosts file. If you configure an unknown host in the following configurations, mbatchd logs an error message: HostGroup or HostPartition sections of the lsb.hosts file, or as a HOSTS definition for a queue in the lsb.queues file.

任务说明

If you try to configure a host that is not listed in the Host section of the lsb.hosts file, the mbatchd daemon logs the following message.

mbatchd on host: LSB_CONFDIR/cluster1/configdir/file(line #): Host hostname is not used by lsbatch; ignored

If you start the sbatchd daemon on a host that is not known by the mbatchd daemon, mbatchd rejects the sbatchd. The sbatchd daemon logs the following message and exits.

This host is not used by lsbatch system.

步骤

  • Add the unknown host to the list of hosts in the Host section of the lsb.hosts file.

  • Start the LSF daemons on the new host.

  • Run the following commands to reconfigure the cluster:

    lsadmin reconfig

    badmin reconfig

未知的主机类型或型号

A model or type UNKNOWN indicates that the host is down or the LIM on the host is down. You need to take immediate action to restart LIM on the UNKNOWN host.

步骤

  1. Start the host.

  2. Run the lshosts command to see which host has the UNKNOWN host type or model.

    lshosts
    HOST_NAME  type       model   cpuf   ncpus  maxmem   maxswp  server   RESOURCES 
    hostA   UNKNOWN      Ultra2   20.2       2    256M    710M      Yes   ()
  3. Run the lsadmin limstartup command to start LIM on the host.

    lsadmin limstartup hostA
    Starting up LIM on <hostA> .... done

    If EGO is enabled in the LSF cluster, you can run the following command instead:

    egosh ego start lim hostA
    Starting up LIM on <hostA> .... done

    You can specify more than one host name to start LIM on multiple hosts. If you do not specify a host name, LIM is started on the host from which the command is submitted.

    To start LIM remotely on UNIX or Linux, you must be root or listed in the lsf.sudoers file (or the ego.sudoers file if EGO is enabled in the LSF cluster). You must be able to run the rsh command across all hosts without entering a password.

  4. Wait a few seconds, then run the lshosts command again.

    The lshosts command displays a specific model or type for the host or DEFAULT. If you see DEFAULT, it means that automatic detection of host type or model failed, and the host type that is configured in the lsf.shared file cannot be found. LSF works on the host, but a DEFAULT model might be inefficient because of incorrect CPU factors. A DEFAULT type might also cause binary incompatibility because a job from a DEFAULT host type can be migrated to another DEFAULT host type.

默认主机类型或型号

If you see DEFAULT in lim -t, it means that automatic detection of host type or model failed, and the host type that is configured in the lsf.shared file cannot be found. LSF works on the host, but a DEFAULT model might be inefficient because of incorrect CPU factors. A DEFAULT type might also cause binary incompatibility because a job from a DEFAULT host type can be migrated to another DEFAULT host type.

步骤

  1. Run the lshosts command to see which host has the DEFAULT host model or type.

    lshosts
    HOST_NAME     type    model     cpuf   ncpus  maxmem  maxswp   server  RESOURCES 
    hostA     DEFAULT  DEFAULT        1       2    256M   710M       Yes  ()

    If Model or Type are displayed as DEFAULT when you use the lshosts command and automatic host model and type detection is enabled, you can leave it as is or change it.

    If the host model is DEFAULT, LSF works correctly but the host has a CPU factor of 1, which might not make efficient use of the host model.

    If the host type is DEFAULT, there might be binary incompatibility. For example, if one host is Linux and another is AIX, but both hosts are set to type DEFAULT, jobs that are running on the Linux host might be migrated to the AIX host and vice versa, which might cause the job to file.

  2. Run lim -t on the host whose type is DEFAULT:

    lim -t
    Host Type             : NTX64
    Host Architecture     : EM64T_1596
    Total NUMA Nodes      : 1
    Total Processors      : 2
    Total Cores           : 4
    Total Threads         : 2
    Matched Type          : NTX64
    Matched Architecture  : EM64T_3000
    Matched Model         : Intel_EM64T
    CPU Factor            : 60.0

    NoteThe value of HostType and Host Architecture.

  3. Edit the lsf.shared file to configure the host type and host model for the host.

    1. In the HostType section, enter a new host type. Use the host type name that is detected with the lim -t command.

      Begin HostType
      TYPENAME 
      DEFAULT 
      CRAYJ
      NTX64
      ...
      End HostType
    2. In the HostModel section, enter the new host model with architecture and CPU factor. Use the architecture that is detected with the lim -t commmand. Add the host model to the end of the host model list. The limit for host model entries is 127. Lines commented out with # are not counted in the 127-line limit.

      Begin HostModel
      MODELNAME   CPUFACTOR     ARCHITECTURE # keyword
      Intel_EM64T      20             EM64T_1596
      End HostModel
  4. Save changes to the lsf.shared file.

  5. Run the lsadmin reconfig command to reconfigure LIM.

  6. Wait a few seconds, and run the lim -t command again to check the type and model of the host.

上一页4.3 LSF 排错下一页LSF 错误信息

最后更新于4年前

这有帮助吗?

For more information, see .

For more information about process tracking with Linux cgroups, see .

Setting up the LSF environment with cshrc.lsf and profile.lsf
LIM dies quietly
LIM communication times out
Master LIM is down
Memory and swap limit enforcement based on Linux cgroup memory subsystem