Добрый день! Имею сервер с 4 дисками. По 2 диска собраны в 2 программных raid1 массива /dev/md0 (/dev/sda7+/dev/sdb3) и /dev/md1 (/dev/sdc1+/dev/sdd1). Эти raid массивы объединены в логическую группу LVM, в которой создан один логический том с файловой системой xfs, занимающий весь объем.
Недавно вышел из строя один из дисков (/dev/sdd1) и перестал определяться в системе. Был куплен абсолютно такой же диск и установлен в сервер
Скопировал таблицу разделов на новый диск с рабочего командой
sfdisk -d /dev/sdc | sfdisk /dev/sdd
Добавил новый диск в /dev/md1 командой
mdadm --manage /dev/md1 --add /dev/sdd1
После перезагрузки начинается восстановление, но при достижении 2,3% восстановление обрывается. Диск /dev/sdd1 принимает статус SPARE.
В dmesg выходит ошибка чтения с диска /dev/sdc1.
SMART диска /dev/sdc1
smartctl version 5.36 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: MB2000EAMZF
Serial Number: 9WM0ETAB
Firmware Version: HPG1
User Capacity: 2▒000▒398▒934▒016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Not recognized. Minor revision code: 0x28
Local Time is: Wed Oct 11 11:29:24 2017 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 609) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 3) minutes.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 070 056 044 Pre-fail Always - 25835348947
3 Spin_Up_Time 0x0003 093 092 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 76
5 Reallocated_Sector_Ct 0x0033 096 096 036 Pre-fail Always - 168
7 Seek_Error_Rate 0x000f 085 060 030 Pre-fail Always - 369643805
9 Power_On_Hours 0x0032 041 041 000 Old_age Always - 51821
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 78
180 Unknown_Attribute 0x003b 100 100 000 Pre-fail Always - 420267169
184 Unknown_Attribute 0x0032 100 100 003 Old_age Always - 0
187 Unknown_Attribute 0x0032 085 085 000 Old_age Always - 15
188 Unknown_Attribute 0x0032 100 097 000 Old_age Always - 26
189 Unknown_Attribute 0x003a 099 099 000 Old_age Always - 1
190 Unknown_Attribute 0x0022 061 055 045 Old_age Always - 690094119
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 38
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 78
194 Temperature_Celsius 0x0022 039 045 000 Old_age Always - 39 (Lifetime Min/Max 0/20)
195 Hardware_ECC_Recovered 0x001a 048 018 000 Old_age Always - 65545171
196 Reallocated_Event_Count 0x0033 096 096 036 Pre-fail Always - 168
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 174 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 174 occurred at disk power-on lifetime: 51820 hours (2159 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 65 70 20 05
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 80 01 70 20 45 00 8d+00:50:15.291 [RESERVED FOR SERIAL ATA]
ef 10 02 00 00 00 a0 00 8d+00:50:15.291 SET FEATURES [Reserved for Serial ATA]
ec 00 00 00 00 00 a0 00 8d+00:50:15.290 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 8d+00:50:15.290 SET FEATURES [Set transfer mode]
ef 10 02 00 00 00 a0 00 8d+00:50:15.290 SET FEATURES [Reserved for Serial ATA]
Error 173 occurred at disk power-on lifetime: 51820 hours (2159 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 65 70 20 05
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 80 01 70 20 45 00 8d+00:50:12.635 [RESERVED FOR SERIAL ATA]
ef 10 02 00 00 00 a0 00 8d+00:50:12.635 SET FEATURES [Reserved for Serial ATA]
ec 00 00 00 00 00 a0 00 8d+00:50:12.634 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 8d+00:50:12.634 SET FEATURES [Set transfer mode]
ef 10 02 00 00 00 a0 00 8d+00:50:12.633 SET FEATURES [Reserved for Serial ATA]
Error 172 occurred at disk power-on lifetime: 51820 hours (2159 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 65 70 20 05
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 80 01 70 20 45 00 8d+00:50:09.971 [RESERVED FOR SERIAL ATA]
60 00 80 01 71 20 45 00 8d+00:50:09.970 [RESERVED FOR SERIAL ATA]
ef 10 02 00 00 00 a0 00 8d+00:50:09.970 SET FEATURES [Reserved for Serial ATA]
ec 00 00 00 00 00 a0 00 8d+00:50:09.969 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 8d+00:50:09.969 SET FEATURES [Set transfer mode]
Error 171 occurred at disk power-on lifetime: 51820 hours (2159 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 65 70 20 05
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 80 01 71 20 45 00 8d+00:50:07.314 [RESERVED FOR SERIAL ATA]
60 00 80 01 70 20 45 00 8d+00:50:07.314 [RESERVED FOR SERIAL ATA]
ef 10 02 00 00 00 a0 00 8d+00:50:07.314 SET FEATURES [Reserved for Serial ATA]
ec 00 00 00 00 00 a0 00 8d+00:50:07.313 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 8d+00:50:07.313 SET FEATURES [Set transfer mode]
Error 170 occurred at disk power-on lifetime: 51820 hours (2159 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 65 70 20 05
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 80 01 70 20 45 00 8d+00:50:04.631 [RESERVED FOR SERIAL ATA]
60 00 80 01 71 20 45 00 8d+00:50:04.630 [RESERVED FOR SERIAL ATA]
60 00 80 81 71 20 45 00 8d+00:50:04.630 [RESERVED FOR SERIAL ATA]
60 00 80 01 72 20 45 00 8d+00:50:04.630 [RESERVED FOR SERIAL ATA]
60 00 80 01 7c 20 45 00 8d+00:50:04.629 [RESERVED FOR SERIAL ATA]
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
В дополнение привожу состояние /dev/md1
[root@e5 mapper]# mdadm -D /dev/md1
/dev/md1:
Version : 00.90.03
Creation Time : Thu Oct 27 14:33:11 2011
Raid Level : raid1
Array Size : 1863013184 (1776.71 GiB 1907.73 GB)
Used Dev Size : 1863013184 (1776.71 GiB 1907.73 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 1
Persistence : Superblock is persistent
Update Time : Wed Oct 11 11:38:49 2017
State : clean, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
UUID : 2bafe702:89d5e11f:da4519c3:ba339ffc
Events : 0.20868067
Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
2 8 49 1 spare rebuilding /dev/sdd1
Подскажите, как можно вернуть в рабочее состояние массив /dev/md1?