SMD

Section: Slurm components (1)
Updated: February 2014
Index

NAME

smd - Used to manage failures in a resource allocation.
smd-リソース割り当ての失敗を管理するために使用されます。

SYNOPSIS

smd [OPTIONS...] [job_id]

DESCRIPTION

Slurm command used to manage failures in a resource allocation.
リソース割り当ての障害を管理するために使用されるSlurmコマンド。

OPTIONS

-c, --show-config: Shows the configuration of smd.
smdの構成を示します。
-d, --drain-node node_name: Drains the hosts of the job (Note: Must include reason -R).
ジョブのホストをドレインします（注：理由-Rを含める必要があります）。
-D, --drop_node node_name: Drops the failed or failing host.
失敗したホストまたは失敗したホストを削除します。
-e, --extend-time: Extends the runtime of the job.
ジョブの実行時間を延長します。
-f, --faulty-nodes node_name: Gets the hosts that are failed or failing hosts.
失敗したホストまたは失敗したホストを取得します。
-j, --job_info: Gets the information of the specified job id.
指定されたジョブIDの情報を取得します。
-r, --replace-node node_name: Replaces the drained host with a new one.
排出されたホストを新しいものと交換します。
-v, --verbose: Prints detailed event logging.
詳細なイベントログを出力します。
Multiple -v's will further increase the verbosity of logging.
複数の-vを使用すると、ロギングの冗長性がさらに高まります。
By default only errors will display.
デフォルトでは、エラーのみが表示されます。

EXAMPLES

See configuration smd.

        > smd -c
        System Configuration:
        ConfigurationFile: /etc/nonstop.conf
        ControllerAddress: localhost
        LibraryDebug: 0
        ControllerPort: 9114
        ReadTimeout: 10000
        WriteTimeout: 10000
        HotSpareCount: "debug:0"
        MaxSpareNodeCount: 10
        TimeLimitDelay: 600
        TimeLimitDrop: 0
        TimeLimitExtend: 2
        UserDrainAllow: "alan,brenda"
        UserDrainDeny: "none"

Replace a failed node in a job allocation and extend its time limit.
ジョブ割り当てで障害が発生したノードを置き換え、その制限時間を延長します。

       $ salloc -N4 --no-kill bash
       salloc: Granted job allocation 67
       $ squeue
       JOBID PARTITION  NAME  USER  ST  TIME  NODES NODELIST(REASON)
          67     debug  bash jette   R  0:48      4 tux[0-3]
       salloc: error: Node failure on tux2
       $ smd -f $SLURM_JOBID
       Job 67 has 1 failed or failing hosts:
         node tux2 cpu_count 1 state FAILED
       $ smd -r tux2 $SLURM_JOBID
       Job 67 got node tux2 replaced with node tux4
       $ squeue
       JOBID PARTITION  NAME  USER  ST  TIME  NODES NODELIST(REASON)
          67     debug  bash jette   R  0:48      4 tux[0-1,3-4]
       $ smd -e 2 $SLURM_JOBID
       Job 67 run time increased by 2min successfully

Identify a failing node in a job allocation, drop it from the job allocation, and extend the job time limit.
ジョブ割り当てで障害のあるノードを特定し、それをジョブ割り当てから削除して、ジョブの時間制限を延長します。

       $ salloc -N4 --no-kill bash
       salloc: Granted job allocation 70
       $ squeue
       JOBID PARTITION  NAME  USER  ST  TIME  NODES NODELIST(REASON)
          69     debug  bash jette   R  0:48      4 tux[0-3]
       $ smd -d tux3 -R "Application X hangs" $SLURM_JOBID
       Job 69 node tux2 is being drained
       $ smd -f
       Job 69 has 1 failed or failing hosts:
         node tux2 cpu_count 1 state FAILING
       $ smd -D tux2 $SLURM_JOBID
       Job 69 node tux2 dropped successfully
       $ squeue
       JOBID PARTITION  NAME  USER  ST  TIME  NODES NODELIST(REASON)
          69     debug  bash jette   R  0:48      4 tux[0-1,3]
       $ smd -e 2 $SLURM_JOBID
       Job 67 run time increased by 2min successfully

COPYING

Slurm is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

This document was created by man2html using the manual pages.
Time: 22:44:54 GMT, November 14, 2016

SMD