LINUX.ORG.RU
ФорумAdmin

Сыпется жесткий?

 ,


0

4

Всем доброго дня! На сервере стоит Centos 6. Пару дней назад пропал /dev/sda. На серваке вроде как должно стоять зеркало из 2х ЖД, но фактически хз, не проверял, руки еще не дошли.Судя по логам отвалился жесткий и сыпятся кластеры? Перезагрузку пока не делал, боюсь все отвалится))) Можно ли в таком состоянии сделать рабочий образ или бекап? Так вот вывод dmesg:

sd 0:0:0:0: [sda] Add. Sense: Recorded entity not found
sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 03 17 12 70 00 00 40 00
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: [sda] killing request
sd 0:0:0:0: rejecting I/O to offline device
Buffer I/O error on device dm-0, logical block 3070616
lost page write due to I/O error on dm-0
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 01 86 4e c0 00 00 08 00
Buffer I/O error on device dm-0, logical block 3068888
lost page write due to I/O error on dm-0
JBD2: Detected IO errors while flushing file data on dm-0-8
Aborting journal on device dm-0-8.
ata1: EH complete
ata1.00: detaching (SCSI 0:0:0:0)
EXT4-fs error (device dm-0): ext4_journal_start_sb:
Buffer I/O error on device dm-0, logical block 6324224
lost page write due to I/O error on dm-0
JBD2: I/O error detected when updating journal superblock for dm-0-8.
EXT4-fs error (device dm-0): ext4_journal_start_sb: Detected aborted journal
EXT4-fs (dm-0): Remounting filesystem read-only
Detected aborted journal
journal commit I/O error
Buffer I/O error on device dm-0, logical block 2621557
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 2621536
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 2621453
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 2662914
EXT4-fs (dm-0): delayed block allocation failed for inode 656425 at logical offset 6809 with max blocks 1 with error -30

This should not happen!!  Data will be lost
EXT4-fs (dm-0): ext4_da_writepages: jbd2_start: 1024 pages, ino 656425; err -30

ata1: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0xe frozen
ata1: irq_stat 0x00000040, connection status changed
ata1: SError: { RecovComm PHYRdyChg CommWake DevExch }
ata1: hard resetting link
EXT4-fs error (device dm-0) in ext4_reserve_inode_write: Journal has aborted
EXT4-fs error (device dm-0) in ext4_reserve_inode_write: Journal has aborted
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 2662917
lost page write due to I/O error on dm-0
sd 0:0:0:0: [sda] Synchronizing SCSI cache
ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
ata1: hard resetting link
ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
ata1: limiting SATA link speed to 3.0 Gbps
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
ata1: EH complete
sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:0:0:0: [sda] Stopping disk
sd 0:0:0:0: [sda] START_STOP FAILED
sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aborting journal on device dm-2-8.
Buffer I/O error on device dm-2, logical block 22577152
lost page write due to I/O error on dm-2
JBD2: I/O error detected when updating journal superblock for dm-2-8.
EXT4-fs error (device sda1): ext4_find_entry: reading directory #2 offset 0
Buffer I/O error on device sda1, logical block 1
lost page write due to I/O error on sda1
Buffer I/O error on device dm-1, logical block 2019312
Buffer I/O error on device dm-1, logical block 2019312
Buffer I/O error on device dm-1, logical block 2019326
Buffer I/O error on device dm-1, logical block 2019326
Buffer I/O error on device dm-1, logical block 2019295
Buffer I/O error on device dm-1, logical block 2019295
Buffer I/O error on device dm-1, logical block 256
Buffer I/O error on device dm-1, logical block 3
Buffer I/O error on device dm-1, logical block 3
Buffer I/O error on device dm-1, logical block 3
__ratelimit: 59 callbacks suppressed
Buffer I/O error on device dm-0, logical block 6324224
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 6324224
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-1, logical block 2019312
Buffer I/O error on device dm-1, logical block 2019326
Buffer I/O error on device dm-1, logical block 2019295
Buffer I/O error on device dm-1, logical block 256
Buffer I/O error on device dm-1, logical block 3
Buffer I/O error on device dm-1, logical block 3
Buffer I/O error on device dm-1, logical block 3
Buffer I/O error on device dm-1, logical block 3
__ratelimit: 106 callbacks suppressed
Buffer I/O error on device dm-1, logical block 2019312
Buffer I/O error on device dm-1, logical block 2019326
Buffer I/O error on device dm-1, logical block 2019295
Buffer I/O error on device dm-1, logical block 256
Buffer I/O error on device dm-1, logical block 3
Buffer I/O error on device dm-1, logical block 3
Buffer I/O error on device dm-1, logical block 3
Buffer I/O error on device dm-1, logical block 3
Buffer I/O error on device dm-1, logical block 7
Buffer I/O error on device dm-1, logical block 7
__ratelimit: 53 callbacks suppressed
Buffer I/O error on device dm-0, logical block 2105589
Buffer I/O error on device dm-0, logical block 2105590
Buffer I/O error on device dm-0, logical block 2105591
Buffer I/O error on device dm-0, logical block 2105592
Buffer I/O error on device dm-0, logical block 2105593
Buffer I/O error on device dm-0, logical block 2105594
Buffer I/O error on device dm-0, logical block 2105595
Buffer I/O error on device dm-0, logical block 2105596
Buffer I/O error on device dm-0, logical block 2105598
Buffer I/O error on device dm-0, logical block 2105600
__ratelimit: 5 callbacks suppressed
Buffer I/O error on device dm-0, logical block 5
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 1057
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 2105589
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 2097259
lost page write due to I/O error on dm-0


Последнее исправление: Reverse (всего исправлений: 2)

Покажи вывод S.M.A.R.T., а заодно замени шлейф.

kostik87 ★★★★★
()

smart sda, а там уже по ходу ...

anonymous
()

Забавный народ пошел, харду хана, так вместо того что бы останавливать и разбираться с железом мы «Перезагрузку пока не делал, боюсь все отвалится», еще напишите что надеетесь на «само рассосется».

anc ★★★★★
()

попробуй продиагностировать/забекапить что ещё читается с помощью sys-block/whdd

haku ★★★★★
()

Немного офтопика. Забавно, у заказчика тоже ша сервак по дисковой системе навернулся, fs в readonly, сливаю полный архив по сети :) Но у меня проще там он виртуальный и с хост системой если че не мне разбираться. :)

anc ★★★★★
()

В общем подготовил я систему на замену, основные сервисы восстановил. После перезагрузки проблемного сервера он загрузился без проблем. Все работает. В чем была проблема, какие логи смотреть? Что проверять?

Reverse
() автор топика

Вывод SMART:

[root@R0 /]# smartctl -a /dev/sda
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-279.el6.x86_64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD2500HHTZ-04N21V0
Serial Number:    WD-WX81C5246856
LU WWN Device Id: 5 0014ee 30014428c
Firmware Version: 04.06A00
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Wed Dec  2 16:24:58 2015 VLAT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                ( 2400) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  31) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x30bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       5273
  3 Spin_Up_Time            0x0027   178   177   021    Pre-fail  Always       -       2075
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       22
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       10
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       1
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11172
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       22
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       16
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       9
194 Temperature_Celsius     0x0022   113   103   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   197   197   000    Old_age   Always       -       3
197 Current_Pending_Sector  0x0032   195   195   000    Old_age   Always       -       232
198 Offline_Uncorrectable   0x0030   195   195   000    Old_age   Offline      -       214
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   198   198   000    Old_age   Offline      -       175

SMART Error Log Version: 1
ATA Error Count: 16 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 16 occurred at disk power-on lifetime: 11155 hours (464 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 08 00 d5 54 e1  Error: IDNF at LBA = 0x0154d500 = 22336768

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 08 00 d5 54 e1 08   1d+20:28:28.497  WRITE DMA
  ca 00 08 60 11 17 e3 08   1d+20:28:28.025  WRITE DMA
  ca 00 20 40 11 17 e3 08   1d+20:28:28.012  WRITE DMA

Error 15 occurred at disk power-on lifetime: 11155 hours (464 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 08 b0 84 86 e1  Error: IDNF at LBA = 0x018684b0 = 25593008

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 08 b0 84 86 e1 08   1d+20:28:10.316  WRITE DMA
  b0 d0 01 00 4f c2 00 08   1d+20:28:07.188  SMART READ DATA
  b0 d8 00 00 4f c2 00 08   1d+20:28:07.127  SMART ENABLE OPERATIONS
  e5 00 00 00 00 00 00 08   1d+20:28:07.127  CHECK POWER MODE
  ec 00 00 00 00 00 00 08   1d+20:28:07.121  IDENTIFY DEVICE

Error 13 occurred at disk power-on lifetime: 10187 hours (424 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 80 58 62 54 e1  Error: UNC 128 sectors at LBA = 0x01546258 = 22307416

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Reverse
() автор топика

И командой

 [root@R0 /]# badblocks /dev/sda > /home/badblocks 
Создается файл с 957 строками, как я понимаю все это БЭДЫ=)

И никак не могу разобраться с Рэйдом, говорили что RAID 1, а по факту подтверждения этому не вижу.

/dev/md0: 0.00KiB (null) 0 devices, 1 spare. Use mdadm --detail for more detail.
[root@R0 /]# mdadm --detail /dev/md0
/dev/md0:
        Version : imsm
     Raid Level : container
  Total Devices : 1

Working Devices : 1


           UUID : bd0bf10b:e17a3b87:c51ca08d:e711cc82
  Member Arrays :

    Number   Major   Minor   RaidDevice

       0       8       16        -        /dev/sdb
Reverse
() автор топика
Ответ на: комментарий от Reverse

как я понимаю

Ты переводчиком не желаешь воспользоваться? Жесткий на выброс, а ты головой думать учись.

Deleted
()
Вы не можете добавлять комментарии в эту тему. Тема перемещена в архив.