Продолжение марлезонского балета.
В общем, решено было записать gentoo livecd и прогнать самописным тестом:
#!/bin/bash
while true; do
echo "$(date -R | awk '{print $2,$3,$5}') Test CPU, HDD (dd + urandom, 128M, 256M, 512M, 1G)" >> /mnt/debian/var/log/testserver.log
`dd if=/dev/urandom of=/mnt/debian/tmp/testfile128M.img bs=1k count=131072 2>/dev/null; echo "$(date -R | awk '{print $2,$3,$5}') Test dd 128M finished" >> /mnt/debian/var/log/testserver.log` &
`dd if=/dev/urandom of=/mnt/debian/tmp/testfile256M.img bs=1k count=262144 2>/dev/null; echo "$(date -R | awk '{print $2,$3,$5}') Test dd 256M finished" >> /mnt/debian/var/log/testserver.log` &
`dd if=/dev/urandom of=/mnt/debian/tmp/testfile512M.img bs=1k count=524288 2>/dev/null; echo "$(date -R | awk '{print $2,$3,$5}') Test dd 512M finished" >> /mnt/debian/var/log/testserver.log` &
`dd if=/dev/urandom of=/mnt/debian/tmp/testfile1G.img bs=1k count=1048576 2>/dev/null; echo "$(date -R | awk '{print $2,$3,$5}') Test dd 1G finished" >> /mnt/debian/var/log/testserver.log`
rm -rf /mnt/debian/tmp/testfile*
echo "$(date -R | awk '{print $2,$3,$5}') Test CPU, HDD (dd, urandom) finished" >> /mnt/debian/var/log/testserver.log
sleep 60
echo "$(date -R | awk '{print $2,$3,$5}') Test memory (copy in tmpfs)" >> /mnt/debian/var/log/testserver.log
cp /mnt/debian/root/iso-* /mem/
echo "$(date -R | awk '{print $2,$3,$5}') Test memory: copying files finished" >> /mnt/debian/var/log/testserver.log
sleep 120
echo "$(date -R | awk '{print $2,$3,$5}') Test memory: finished. Copying files not deleted" >> /mnt/debian/var/log/testserver.log
echo "$(date -R | awk '{print $2,$3,$5}') Test HDD (dd + /dev/zero, 10G)" >> /mnt/debian/var/log/testserver.log
dd if=/dev/zero of=/mnt/debian/tmp/testfile10G.img bs=1k count=10485760 2>/dev/null
echo "$(date -R | awk '{print $2,$3,$5}') Test HDD (dd + /dev/zero, 10G), finished" >> /mnt/debian/var/log/testserver.log
rm -rf /mnt/debian/tmp/testfile*
rm -rf /mem/iso-*
sleep 120
done
До этого я трижды прогонял этот тест (на Debiane, который был установлен на этом рейде), и дважды сервак вис после «Test dd 512M finished».
Теперь, я получил возможность срисовать логи, итак, грузимся, монтируемся:
...
Jan 23 12:26:26 livecd kernel: EXT4-fs (cciss!c0d0p1): mounted filesystem with ordered data mode. Opts: (null)
Jan 23 12:27:14 livecd dhcpcd[24180]: enp3s1f1: broadcasting for a lease
Jan 23 12:27:15 livecd dhcpcd[24180]: enp3s1f0: broadcasting for a lease
...
Спустя 2,5 суток, упс!
...
Jan 25 20:24:47 livecd kernel: INFO: task jbd2/cciss!c0d0:24327 blocked for more than 120 seconds.
Jan 25 20:24:47 livecd kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 25 20:24:47 livecd kernel: jbd2/cciss!c0d0 D f6788000 0 24327 2 0x00000000
Jan 25 20:24:47 livecd kernel: f3c23b10 00000046 1b54d000 f6788000 f6c59770 0a646588 00024a73 c155ef80
Jan 25 20:24:47 livecd kernel: c155ef80 c14d2140 f6febcd8 c1045fb5 f6febce8 f6febce8 00000202 00000202
Jan 25 20:24:47 livecd kernel: c10591d9 3b4d88fb 00000007 0a6476d8 00024a73 f6febcc8 0266a950 c14d2140
Jan 25 20:24:47 livecd kernel: Call Trace:
Jan 25 20:24:47 livecd kernel: [<c1045fb5>] ? ktime_get_ts+0x41/0xc6
Jan 25 20:24:47 livecd kernel: [<c10591d9>] ? delayacct_end+0x8d/0x95
Jan 25 20:24:47 livecd kernel: [<c10a24e6>] ? __wait_on_buffer+0x1a/0x1a
Jan 25 20:24:47 livecd kernel: [<c1045fb5>] ? ktime_get_ts+0x41/0xc6
Jan 25 20:24:47 livecd kernel: [<c10a24e6>] ? __wait_on_buffer+0x1a/0x1a
Jan 25 20:24:47 livecd kernel: [<c1398533>] ? io_schedule+0x6e/0xa2
Jan 25 20:24:47 livecd kernel: [<c10a24eb>] ? sleep_on_buffer+0x5/0x8
Jan 25 20:24:47 livecd kernel: [<c1397150>] ? __wait_on_bit+0x2d/0x58
Jan 25 20:24:47 livecd kernel: [<c13971d6>] ? out_of_line_wait_on_bit+0x5b/0x63
Jan 25 20:24:47 livecd kernel: [<c10a24e6>] ? __wait_on_buffer+0x1a/0x1a
Jan 25 20:24:47 livecd kernel: [<c1036f3a>] ? autoremove_wake_function+0x29/0x29
Jan 25 20:24:47 livecd kernel: [<c10a24e4>] ? __wait_on_buffer+0x18/0x1a
Jan 25 20:24:47 livecd kernel: [<c113739a>] ? jbd2_journal_commit_transaction+0xaa5/0x1077
Jan 25 20:24:47 livecd kernel: [<c113971c>] ? kjournald2+0xa3/0x1ac
Jan 25 20:24:47 livecd kernel: [<c113971c>] ? kjournald2+0xa3/0x1ac
Jan 25 20:24:47 livecd kernel: [<c1036f11>] ? abort_exclusive_wait+0x63/0x63
Jan 25 20:24:47 livecd kernel: [<c1139679>] ? commit_timeout+0x5/0x5
Jan 25 20:24:47 livecd kernel: [<c103686b>] ? kthread+0x8d/0x92
Jan 25 20:24:47 livecd kernel: [<c13997bb>] ? ret_from_kernel_thread+0x1b/0x30
Jan 25 20:24:47 livecd kernel: [<c10367de>] ? __kthread_parkme+0x50/0x50
...
Вот лог при попытке писать процессом test.sh:
...
Jan 25 20:26:47 livecd kernel: INFO: task test.sh:28913 blocked for more than 120 seconds.
Jan 25 20:26:47 livecd kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 25 20:26:47 livecd kernel: test.sh D 00000001 0 28913 24246 0x00000000
Jan 25 20:26:47 livecd kernel: f654bb10 00000086 ffffffff 00000001 f6c59030 c78d2438 c3ef4358 c155ef80
Jan 25 20:26:47 livecd kernel: c155ef80 f6a0b6a0 db44c588 db44c5c0 00001000 f533f780 f685f320 f31e9000
Jan 25 20:26:47 livecd kernel: c10a4620 00001000 00000001 f7548760 c155ef80 00000246 00000246 c1036e16
Jan 25 20:26:47 livecd kernel: Call Trace:
Jan 25 20:26:47 livecd kernel: [<c10a4620>] ? __getblk+0x23/0x283
Jan 25 20:26:47 livecd kernel: [<c1036e16>] ? prepare_to_wait+0x57/0x5f
Jan 25 20:26:47 livecd kernel: [<c11361d4>] ? do_get_write_access+0x1ae/0x346
Jan 25 20:26:47 livecd kernel: [<c1036f3a>] ? autoremove_wake_function+0x29/0x29
Jan 25 20:26:47 livecd kernel: [<c113644f>] ? jbd2_journal_get_write_access+0x18/0x26
Jan 25 20:26:47 livecd kernel: [<c111eb42>] ? __ext4_journal_get_write_access+0x3b/0x4e
Jan 25 20:26:47 livecd kernel: [<c11086a0>] ? ext4_reserve_inode_write+0x2c/0x5f
Jan 25 20:26:47 livecd kernel: [<c11086f2>] ? ext4_mark_inode_dirty+0x1f/0x14f
Jan 25 20:26:47 livecd kernel: [<c110aeba>] ? ext4_dirty_inode+0x2c/0x42
Jan 25 20:26:47 livecd kernel: [<c109df06>] ? __mark_inode_dirty+0x1e/0x185
Jan 25 20:26:47 livecd kernel: [<c1095d8b>] ? update_time+0x81/0x8a
Jan 25 20:26:47 livecd kernel: [<c1095e05>] ? file_update_time+0x71/0x8a
Jan 25 20:26:47 livecd kernel: [<c1063364>] ? __generic_file_aio_write+0x239/0x3a0
Jan 25 20:26:47 livecd kernel: [<c106351f>] ? generic_file_aio_write+0x54/0x9a
Jan 25 20:26:47 livecd kernel: [<c1103089>] ? ext4_file_write+0x3ec/0x413
Jan 25 20:26:47 livecd kernel: [<c108d348>] ? link_path_walk+0x57/0x675
Jan 25 20:26:47 livecd kernel: [<c1075289>] ? do_wp_page+0x2b6/0x641
Jan 25 20:26:47 livecd kernel: [<c108f884>] ? path_openat.isra.57+0x9c/0x33a
Jan 25 20:26:47 livecd kernel: [<c1076c32>] ? handle_pte_fault+0x5f6/0x633
Jan 25 20:26:47 livecd kernel: [<c108574d>] ? do_sync_write+0x61/0x88
Jan 25 20:26:47 livecd kernel: [<c10856ec>] ? do_sync_readv_writev+0x8d/0x8d
Jan 25 20:26:47 livecd kernel: [<c1085fa6>] ? vfs_write+0x9d/0x13d
Jan 25 20:26:47 livecd kernel: [<c10862ad>] ? SyS_write+0x3f/0x6c
Jan 25 20:26:47 livecd kernel: [<c1398ff8>] ? syscall_call+0x7/0xb
...
Итого: баг проявляется на разных винтах, т.е. это не проблема бэдов. Получается гайка контроллеру/материнке?
Решено обновлением прошивки