Добрый день!
На управляющем узле стоит Debian 8 и можно установить slurm только версии 14.03.9. На вычислительных нодах захотел обновить систему до Debian 9. И там поставился slurmd 16.05.9. В результате, с управляющего узла нод пингуется. И на управляющем узле сервис slurmctld и на ноде slurmd работают. Однако нод в sinfo всё равно имеет статус down.
scontrol show slurmd
на ноде выдаёт
Active Steps = NONE
Actual CPUs = 32
Actual Boards = 1
Actual sockets = 4
Actual cores = 8
Actual threads per core = 1
Actual real memory = 257950 MB
Actual temp disk space = 1024 MB
Boot time = 2019-10-22T20:23:44
Hostname = cn5
Last slurmctld msg time = NONE
Slurmd PID = 967
Slurmd Debug = 3
Slurmd Logfile = /var/log/slurm-llnl/slurmd.log
Version = 16.05.9
Т.е. он не разу не зарегистрировался.
Лог по адресу /var/log/slurm-llnl/slurmd.log выдаёт
[2019-10-23T06:25:06.031] Considering each NUMA node as a socket
[2019-10-23T06:25:06.032] Node configuration differs from hardware: CPUs=32:32(hw) Boards=1:1(hw) SocketsPerBoard=2:4(hw) CoresPerSocket=16:8(hw) ThreadsPerCore=1:1(hw)
[2019-10-23T06:25:06.032] Message aggregation disabled
[2019-10-23T06:25:06.034] Resource spec: Reserved system memory limit not configured for this node
[2019-10-23T06:25:06.056] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2019-10-23T06:25:06.066] error: Unable to register: Zero Bytes were transmitted or received
[2019-10-23T06:25:07.077] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2019-10-23T06:25:07.087] error: Unable to register: Zero Bytes were transmitted or received
[2019-10-23T06:25:08.099] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2019-10-23T06:25:08.109] error: Unable to register: Zero Bytes were transmitted or received
На всякий случай ещё прикреплю конфиг файл slurm
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=master
ControlAddr=192.168.8.8
#BackupController=
#BackupAddr=
#
AuthType=auth/none
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/openssl
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/lib/slurm/checkpoint
JobCredentialPrivateKey=/NAS_config/slurm/keys/key
JobCredentialPublicCertificate=/NAS_config/slurm/keys/certificate
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/usr/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=2
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerPlugin=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=cn[1-13] NodeAddr=192.168.3.[1-13] CPUs=32 Sockets=2 CoresPerSocket=16 State=UNKNOWN
PartitionName=main Nodes=cn[1-13] Default=NO MaxTime=INFINITE State=UP
Вопрос, ошибка связана с разной версией slurm на управляющем узле и нодах и придётся обновлять систему и на управляющем узле, что бы была возможность обновить slurm или всё таки ошибка с какими то настройками и есть возможность запускать slurm на разный версиях ОС.