SMD

Section: Slurm components (1)
Updated: February 2014
Index

 

NAME

smd - Used to manage failures in a resource allocation.
smd-リソース割り当ての失敗を管理するために使用されます。

 

SYNOPSIS

smd    [OPTIONS...] [job_id]

 

DESCRIPTION

Slurm command used to manage failures in a resource allocation.
リソース割り当ての障害を管理するために使用されるSlurmコマンド。

 

OPTIONS

-c, --show-config
Shows the configuration of smd.
smdの構成を示します。
-d, --drain-node node_name
Drains the hosts of the job (Note: Must include reason -R).
ジョブのホストをドレインします(注:理由-Rを含める必要があります)。
-D, --drop_node node_name
Drops the failed or failing host.
失敗したホストまたは失敗したホストを削除します。
-e, --extend-time
Extends the runtime of the job.
ジョブの実行時間を延長します。
-f, --faulty-nodes node_name
Gets the hosts that are failed or failing hosts.
失敗したホストまたは失敗したホストを取得します。
-j, --job_info
Gets the information of the specified job id.
指定されたジョブIDの情報を取得します。
-r, --replace-node node_name
Replaces the drained host with a new one.
排出されたホストを新しいものと交換します。
-v, --verbose
Prints detailed event logging.
詳細なイベントログを出力します。
Multiple -v's will further increase the verbosity of logging.
複数の-vを使用すると、ロギングの冗長性がさらに高まります。
By default only errors will display.
デフォルトでは、エラーのみが表示されます。

 

EXAMPLES

See configuration smd.
        > smd -c
        System Configuration:
        ConfigurationFile: /etc/nonstop.conf
        ControllerAddress: localhost
        LibraryDebug: 0
        ControllerPort: 9114
        ReadTimeout: 10000
        WriteTimeout: 10000
        HotSpareCount: "debug:0"
        MaxSpareNodeCount: 10
        TimeLimitDelay: 600
        TimeLimitDrop: 0
        TimeLimitExtend: 2
        UserDrainAllow: "alan,brenda"
        UserDrainDeny: "none"

Replace a failed node in a job allocation and extend its time limit.
ジョブ割り当てで障害が発生したノードを置き換え、その制限時間を延長します。

       $ salloc -N4 --no-kill bash
       salloc: Granted job allocation 67
       $ squeue
       JOBID PARTITION  NAME  USER  ST  TIME  NODES NODELIST(REASON)
          67     debug  bash jette   R  0:48      4 tux[0-3]
       salloc: error: Node failure on tux2
       $ smd -f $SLURM_JOBID
       Job 67 has 1 failed or failing hosts:
         node tux2 cpu_count 1 state FAILED
       $ smd -r tux2 $SLURM_JOBID
       Job 67 got node tux2 replaced with node tux4
       $ squeue
       JOBID PARTITION  NAME  USER  ST  TIME  NODES NODELIST(REASON)
          67     debug  bash jette   R  0:48      4 tux[0-1,3-4]
       $ smd -e 2 $SLURM_JOBID
       Job 67 run time increased by 2min successfully

Identify a failing node in a job allocation, drop it from the job allocation, and extend the job time limit.
ジョブ割り当てで障害のあるノードを特定し、それをジョブ割り当てから削除して、ジョブの時間制限を延長します。

       $ salloc -N4 --no-kill bash
       salloc: Granted job allocation 70
       $ squeue
       JOBID PARTITION  NAME  USER  ST  TIME  NODES NODELIST(REASON)
          69     debug  bash jette   R  0:48      4 tux[0-3]
       $ smd -d tux3 -R "Application X hangs" $SLURM_JOBID
       Job 69 node tux2 is being drained
       $ smd -f
       Job 69 has 1 failed or failing hosts:
         node tux2 cpu_count 1 state FAILING
       $ smd -D tux2 $SLURM_JOBID
       Job 69 node tux2 dropped successfully
       $ squeue
       JOBID PARTITION  NAME  USER  ST  TIME  NODES NODELIST(REASON)
          69     debug  bash jette   R  0:48      4 tux[0-1,3]
       $ smd -e 2 $SLURM_JOBID
       Job 67 run time increased by 2min successfully

 

COPYING

Copyright (C) 2013-2014 SchedMD LLC. All rights reserved.

Slurm is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

 

SEE ALSO

nonstop.conf(5)


 

Index

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
EXAMPLES
COPYING
SEE ALSO

This document was created by man2html using the manual pages.
Time: 22:44:54 GMT, November 14, 2016