Что-то не пойму никак, что случилось. Раз в неделю стал падать сервер (пингуется, но ни по SSH, ни через что-либо другое соединиться с ним нельзя - лечится простым ребутом кнопкой), опытным путем выяснил, что вешает его скрипт 99-raid-check, который вроде рейд проверяет.
Вот содержимое этого скрипта:
#!/bin/bash
#
# This script reads it's configuration from /etc/sysconfig/raid-check
# Please use that file to enable/disable this script or to set the
# type of check you wish performed.
[ -f /etc/sysconfig/raid-check ] || exit 0
. /etc/sysconfig/raid-check
[ "$ENABLED" != "yes" ] && exit 0
case "$CHECK" in
check) ;;
repair) ;;
*) exit 0;;
esac
active_list=`grep "^md.*: active" /proc/mdstat | cut -f 1 -d ' '`
[ -z "$active_list" ] && exit 0
dev_list=""
check_list=""
devnum=0
for dev in $active_list; do
echo $SKIP_DEVS | grep -w $dev >/dev/null 2>&1 && continue
if [ -f /sys/block/$dev/md/sync_action ]; then
# Only perform the checks on idle, healthy arrays, but delay
# actually writing the check field until the next loop so we
# don't switch currently idle arrays to active, which happens
# when two or more arrays are on the same physical disk
array_state=`cat /sys/block/$dev/md/array_state`
sync_action=`cat /sys/block/$dev/md/sync_action`
if [ "$array_state" = clean -a "$sync_action" = idle ]; then
ck=""
echo $REPAIR_DEVS | grep -w $dev >/dev/null 2>&1 && ck="repair"
echo $CHECK_DEVS | grep -w $dev >/dev/null 2>&1 && ck="check"
[ -z "$ck" ] && ck=$CHECK
dev_list="$dev_list $dev"
check[$devnum]=$ck
let devnum++
[ "$ck" = "check" ] && check_list="$check_list $dev"
fi
fi
done
[ -z "$dev_list" ] && exit 0
devnum=0
for dev in $dev_list; do
echo "${check[$devnum]}" > /sys/block/$dev/md/sync_action
let devnum++
done
[ -z "$check_list" ] && exit 0
checking=1
while [ $checking -ne 0 ]
do
sleep 60
checking=0
for dev in $check_list; do
sync_action=`cat /sys/block/$dev/md/sync_action`
if [ "$sync_action" != "idle" ]; then
checking=1
fi
done
done
for dev in $check_list; do
mismatch_cnt=`cat /sys/block/$dev/md/mismatch_cnt`
if [ "$mismatch_cnt" -ne 0 ]; then
echo "WARNING: mismatch_cnt is not 0 on /dev/$dev"
fi
done
Вот содержимое /etc/cron.weekly/raid-check
#!/bin/bash
#
# Configuration file for /etc/cron.weekly/raid-check
#
# options:
# ENABLED - must be yes in order for the raid check to proceed
# CHECK - can be either check or repair depending on the type of
# operation the user desires. A check operation will scan
# the drives looking for bad sectors and automatically
# repairing only bad sectors. If it finds good sectors that
# contain bad data (meaning that the data in a sector does
# not agree with what the data from another disk indicates
# the data should be, for example the parity block + the other
# data blocks would cause us to think that this data block
# is incorrect), then it does nothing but increments the
# counter in the file /sys/block/$dev/md/mismatch_count.
# This allows the sysadmin to inspect the data in the sector
# and the data that would be produced by rebuilding the
# sector from redundant information and pick the correct
# data to keep. The repair option does the same thing, but
# when it encounters a mismatch in the data, it automatically
# updates the data to be consistent. However, since we really
# don't know whether it's the parity or the data block that's
# correct (or which data block in the case of raid1), it's
# luck of the draw whether or not the user gets the right
# data instead of the bad data. This option is the default
# option for devices not listed in either CHECK_DEVS or
# REPAIR_DEVS.
# CHECK_DEVS - a space delimited list of devs that the user specifically
# wants to run a check operation on.
# REPAIR_DEVS - a space delimited list of devs that the user
# specifically wants to run a repair on.
# SKIP_DEVS - a space delimited list of devs that should be skipped
#
# Note: the raid-check script intentionaly runs last in the cron.weekly
# sequence. This is so we can wait for all the resync operations to complete
# and then check the mismatch_count on each array without unduly delaying
# other weekly cron jobs. If any arrays have a non-0 mismatch_count after
# the check completes, we echo a warning to stdout which will then me emailed
# to the admin as long as mails from cron jobs have not been redirected to
# /dev/null. We do not wait for repair operations to complete as the
# md stack will correct any mismatch_cnts automatically.
#
# Note2: you can not use symbolic names for the raid devices, such as you
# /dev/md/root. The names used in this file must match the names seen in
# /proc/mdstat and in /sys/block.
ENABLED=yes
CHECK=check
# To check devs /dev/md0 and /dev/md3, use "md0 md3"
CHECK_DEVS=""
REPAIR_DEVS=""
SKIP_DEVS=""
Все это ставилось еще без меня и несколько лет не глючило.
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]
md5 : active raid1 sdb2[1] sda2[0]
296961408 blocks [2/2] [UU]
md3 : active raid1 sdb3[1] sda3[0]
102398208 blocks [2/2] [UU]
md6 : active raid1 sdb5[1] sda5[0]
10241280 blocks [2/2] [UU]
md4 : active raid1 sdb6[1] sda6[0]
10241280 blocks [2/2] [UU]
md2 : active raid1 sdb8[1]
6144704 blocks [2/1] [_U]
md1 : active raid1 sdb7[1] sda7[0]
10241280 blocks [2/2] [UU]
Вот тут настораживает, что sda8[0] нет, но я интересовался у старого админа - говорит, это нормально, потому что там своп.
Логи и все возможные значения
cat /sys/block/$dev/md/mismatch_cnt
изучал - нигде никаких ошибок и ничего подозрительного не нашел.
Подкиньте, пожалуйста, идейку где еще поковыряться и что проверить?
На данный момент я сделал только
, что бы не зависало, но хотелось бы докопаться до проблемы, а не просто симптомы снять.