99-raid-check вешает федору

0

1

Что-то не пойму никак, что случилось. Раз в неделю стал падать сервер (пингуется, но ни по SSH, ни через что-либо другое соединиться с ним нельзя - лечится простым ребутом кнопкой), опытным путем выяснил, что вешает его скрипт 99-raid-check, который вроде рейд проверяет.

Вот содержимое этого скрипта:

#!/bin/bash
#
# This script reads it's configuration from /etc/sysconfig/raid-check
# Please use that file to enable/disable this script or to set the
# type of check you wish performed.

[ -f /etc/sysconfig/raid-check ] || exit 0
. /etc/sysconfig/raid-check

[ "$ENABLED" != "yes" ] && exit 0

case "$CHECK" in
    check) ;;
    repair) ;;
    *) exit 0;;
esac

active_list=`grep "^md.*: active" /proc/mdstat | cut -f 1 -d ' '`
[ -z "$active_list" ] && exit 0

dev_list=""
check_list=""
devnum=0
for dev in $active_list; do
    echo $SKIP_DEVS | grep -w $dev >/dev/null 2>&1 && continue
    if [ -f /sys/block/$dev/md/sync_action ]; then
	# Only perform the checks on idle, healthy arrays, but delay
	# actually writing the check field until the next loop so we
	# don't switch currently idle arrays to active, which happens
	# when two or more arrays are on the same physical disk
	array_state=`cat /sys/block/$dev/md/array_state`
	sync_action=`cat /sys/block/$dev/md/sync_action`
	if [ "$array_state" = clean -a "$sync_action" = idle ]; then
	    ck=""
	    echo $REPAIR_DEVS | grep -w $dev >/dev/null 2>&1 && ck="repair"
	    echo $CHECK_DEVS | grep -w $dev >/dev/null 2>&1 && ck="check"
	    [ -z "$ck" ] && ck=$CHECK
	    dev_list="$dev_list $dev"
	    check[$devnum]=$ck
	    let devnum++
	    [ "$ck" = "check" ] && check_list="$check_list $dev"
	fi
    fi
done
[ -z "$dev_list" ] && exit 0

devnum=0
for dev in $dev_list; do
    echo "${check[$devnum]}" > /sys/block/$dev/md/sync_action
    let devnum++
done
[ -z "$check_list" ] && exit 0

checking=1
while [ $checking -ne 0 ]
do
	sleep 60
	checking=0
	for dev in $check_list; do
	sync_action=`cat /sys/block/$dev/md/sync_action`
		if [ "$sync_action" != "idle" ]; then
			checking=1
		fi
	done
done
for dev in $check_list; do
	mismatch_cnt=`cat /sys/block/$dev/md/mismatch_cnt`
	if [ "$mismatch_cnt" -ne 0 ]; then
		echo "WARNING: mismatch_cnt is not 0 on /dev/$dev"
	fi
done

Вот содержимое /etc/cron.weekly/raid-check

#!/bin/bash
#
# Configuration file for /etc/cron.weekly/raid-check
#
# options:
#	ENABLED - must be yes in order for the raid check to proceed
#	CHECK - can be either check or repair depending on the type of
#		operation the user desires.  A check operation will scan
#		the drives looking for bad sectors and automatically
#		repairing only bad sectors.  If it finds good sectors that
#		contain bad data (meaning that the data in a sector does
#		not agree with what the data from another disk indicates
#		the data should be, for example the parity block + the other
#		data blocks would cause us to think that this data block
#		is incorrect), then it does nothing but increments the
#		counter in the file /sys/block/$dev/md/mismatch_count.
#		This allows the sysadmin to inspect the data in the sector
#		and the data that would be produced by rebuilding the
#		sector from redundant information and pick the correct
#		data to keep.  The repair option does the same thing, but
#		when it encounters a mismatch in the data, it automatically
#		updates the data to be consistent.  However, since we really
#		don't know whether it's the parity or the data block that's
#		correct (or which data block in the case of raid1), it's
#		luck of the draw whether or not the user gets the right
#		data instead of the bad data.  This option is the default
#		option for devices not listed in either CHECK_DEVS or
#		REPAIR_DEVS.
#	CHECK_DEVS - a space delimited list of devs that the user specifically
#		wants to run a check operation on.
#	REPAIR_DEVS - a space delimited list of devs that the user
#		specifically wants to run a repair on.
#	SKIP_DEVS - a space delimited list of devs that should be skipped
#
# Note: the raid-check script intentionaly runs last in the cron.weekly
# sequence.  This is so we can wait for all the resync operations to complete
# and then check the mismatch_count on each array without unduly delaying
# other weekly cron jobs.  If any arrays have a non-0 mismatch_count after
# the check completes, we echo a warning to stdout which will then me emailed
# to the admin as long as mails from cron jobs have not been redirected to
# /dev/null.  We do not wait for repair operations to complete as the
# md stack will correct any mismatch_cnts automatically.
#
# Note2: you can not use symbolic names for the raid devices, such as you
# /dev/md/root.  The names used in this file must match the names seen in
# /proc/mdstat and in /sys/block.

ENABLED=yes
CHECK=check
# To check devs /dev/md0 and /dev/md3, use "md0 md3"
CHECK_DEVS=""
REPAIR_DEVS=""
SKIP_DEVS=""

Все это ставилось еще без меня и несколько лет не глючило.

# cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 sdb1[1] sda1[0]
      104320 blocks [2/2] [UU]
      
md5 : active raid1 sdb2[1] sda2[0]
      296961408 blocks [2/2] [UU]
      
md3 : active raid1 sdb3[1] sda3[0]
      102398208 blocks [2/2] [UU]
      
md6 : active raid1 sdb5[1] sda5[0]
      10241280 blocks [2/2] [UU]
      
md4 : active raid1 sdb6[1] sda6[0]
      10241280 blocks [2/2] [UU]
      
md2 : active raid1 sdb8[1]
      6144704 blocks [2/1] [_U]
      
md1 : active raid1 sdb7[1] sda7[0]
      10241280 blocks [2/2] [UU]

Вот тут настораживает, что sda8[0] нет, но я интересовался у старого админа - говорит, это нормально, потому что там своп.

Логи и все возможные значения

cat /sys/block/$dev/md/mismatch_cnt

изучал - нигде никаких ошибок и ничего подозрительного не нашел.

Подкиньте, пожалуйста, идейку где еще поковыряться и что проверить? На данный момент я сделал только

ENABLED=no

, что бы не зависало, но хотелось бы докопаться до проблемы, а не просто симптомы снять.

Ссылка

Ребутает его удалённый человек, который может только нажать на одну кнопку?

Возможно у тебя нарастает la и получается что-то типа форк бомбы. Желательно во время проблемы посмотреть что активно в системе. Или подключаться физически, или в период между запуском raid-check и полной потерей пытаться смотреть активность.

sin_a ★★★★★
(25.04.14 11:01:38 MSK)

Ответ на: комментарий от sin_a 25.04.14 11:01:38 MSK

Да, к сожалению серевер удален и по факту там только человек для нажатия на ребут есть. Однако, если ничего другого не остается - приедтся выехать на место и действительно смотреть там.

Пока вот побольше инфы собираю.

kklkkl
(25.04.14 11:15:38 MSK) автор топика

Ответ на: комментарий от kklkkl 25.04.14 11:15:38 MSK

Оно наверно не сразу теряется. Можно попробовать половить в этот интервал времени.

sin_a ★★★★★
(25.04.14 11:18:24 MSK)

Похожие темы