LINUX.ORG.RU
ФорумAdmin

начал палать сервер Centos

 ,


0

1

Несколько дней подряд, примерно раз в сутки насал зависать намертво сервер под Centos 7. год работал стабильно. В логах

dmesg

.

[    0.984545] NET: Registered protocol family 38
[    0.984557] Key type asymmetric registered
[    0.984561] Asymmetric key parser 'x509' registered
[    0.984608] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 248)
[    0.984678] io scheduler noop registered
[    0.984684] io scheduler deadline registered (default)
[    0.984727] io scheduler cfq registered
[    0.984731] io scheduler mq-deadline registered
[    0.984735] io scheduler kyber registered
[    0.985316] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
[    0.985327] pciehp: PCI Express Hot Plug Controller Driver version: 0.4
[    0.985423] intel_idle: MWAIT substates: 0x1120
[    0.985425] intel_idle: v0.4.1 model 0x1A
[    0.985571] intel_idle: lapic_timer_reliable_states 0x2
[    0.985671] input: Power Button as /devices/LNXSYSTM:00/device:00/PNP0C0C:00/input/input0
[    0.985675] ACPI: Power Button [PWRB]
[    0.985723] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input1
[    0.985726] ACPI: Power Button [PWRF]
[    0.985816] ACPI: Requesting acpi_cpufreq
[    0.986465] ERST: Failed to get Error Log Address Range.
[    0.991168] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
[    0.991183] GHES: Failed to enable APEI firmware first mode.
[    0.991260] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    1.011793] 00:03: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
[    1.032351] 00:04: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
[    1.032828] Non-volatile memory driver v1.3
[    1.032859] Linux agpgart interface v0.103
[    1.033012] crash memory driver: version 1.1
[    1.033173] rdac: device handler registered
[    1.033209] hp_sw: device handler registered
[    1.033212] emc: device handler registered
[    1.033267] alua: device handler registered
[    1.033316] libphy: Fixed MDIO Bus: probed
[    1.033373] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    1.033378] ehci-pci: EHCI PCI platform driver
[    1.033603] ehci-pci 0000:00:1a.7: EHCI Host Controller
[    1.033672] ehci-pci 0000:00:1a.7: new USB bus registered, assigned bus number 1
[    1.033685] ehci-pci 0000:00:1a.7: debug port 1
[    1.037582] ehci-pci 0000:00:1a.7: cache line size of 256 is not supported
[    1.037597] ehci-pci 0000:00:1a.7: irq 18, io mem 0xfbeda000
[    1.043141] ehci-pci 0000:00:1a.7: USB 2.0 started, EHCI 1.00
[    1.043203] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
[    1.043206] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    1.043208] usb usb1: Product: EHCI Host Controller
[    1.043211] usb usb1: Manufacturer: Linux 3.10.0-862.14.4.el7.x86_64 ehci_hcd
[    1.043213] usb usb1: SerialNumber: 0000:00:1a.7
[    1.043340] hub 1-0:1.0: USB hub found
[    1.043348] hub 1-0:1.0: 6 ports detected
[    1.043667] ehci-pci 0000:00:1d.7: EHCI Host Controller
[    1.043721] ehci-pci 0000:00:1d.7: new USB bus registered, assigned bus number 2
[    1.043733] ehci-pci 0000:00:1d.7: debug port 1
[    1.047621] ehci-pci 0000:00:1d.7: cache line size of 256 is not supported
[    1.047635] ehci-pci 0000:00:1d.7: irq 23, io mem 0xfbed8000
[    1.053141] ehci-pci 0000:00:1d.7: USB 2.0 started, EHCI 1.00
[    1.053177] usb usb2: New USB device found, idVendor=1d6b, idProduct=0002
[    1.053180] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
Больше ошибок нет


Это лог messages в момент зависания:

Nov  7 09:44:01 host002 systemd: Started Session 7338 of user root.
Nov  7 09:44:01 host002 systemd: Starting Session 7338 of user root.
Nov  7 10:21:39 host002 journal: Runtime journal is using 8.0M (max allowed 2.0G, trying to leave 3.0G free of 20.5G available → current limit 2.0G).
Nov  7 10:21:39 host002 kernel: microcode: microcode updated early to revision 0x1d, date = 2018-05-11
Nov  7 10:21:39 host002 kernel: Initializing cgroup subsys cpuset
Nov  7 10:21:39 host002 kernel: Initializing cgroup subsys cpu
Nov  7 10:21:39 host002 kernel: Initializing cgroup subsys cpuacct
Nov  7 10:21:39 host002 kernel: Linux version 3.10.0-862.14.4.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Wed Sep 26 15:12:11 UTC 2018
Nov  7 10:21:39 host002 kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-862.14.4.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet LANG=ru_R$
Nov  7 10:21:39 host002 kernel: e820: BIOS-provided physical RAM map:
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x0000000000000000-0x0000000000099fff] usable
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x000000000009a000-0x000000000009ffff] reserved
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bf77ffff] usable
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000bf78e000-0x00000000bf78ffff] type 9
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000bf790000-0x00000000bf79dfff] ACPI data
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000bf79e000-0x00000000bf7cffff] ACPI NVS
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000bf7d0000-0x00000000bf7dffff] reserved
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000bf7ec000-0x00000000bfffffff] reserved
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000ffc00000-0x00000000ffffffff] reserved
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x0000000100000000-0x0000000abfffffff] usable
Nov  7 10:21:39 host002 kernel: NX (Execute Disable) protection: active
Nov  7 10:21:39 host002 kernel: SMBIOS 2.6 present.
Nov  7 10:21:39 host002 kernel: e820: last_pfn = 0xac0000 max_arch_pfn = 0x400000000
Nov  7 10:21:39 host002 kernel: PAT configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- UC
Nov  7 10:21:39 host002 kernel: total RAM covered: 43000M
Nov  7 09:44:01 host002 systemd: Starting Session 7338 of user root.
Nov  7 10:21:39 host002 journal: Runtime journal is using 8.0M (max allowed 2.0G, trying to leave 3.0G free of 20.5G available → current limit 2.0G).
Nov  7 10:21:39 host002 kernel: microcode: microcode updated early to revision 0x1d, date = 2018-05-11
Nov  7 10:21:39 host002 kernel: Initializing cgroup subsys cpuset
Nov  7 10:21:39 host002 kernel: Initializing cgroup subsys cpu
Nov  7 10:21:39 host002 kernel: Initializing cgroup subsys cpuacct
Nov  7 10:21:39 host002 kernel: Linux version 3.10.0-862.14.4.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Wed Sep 26 15:12:11 UTC 2018
Nov  7 10:21:39 host002 kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-862.14.4.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet LANG=ru_R$
Nov  7 10:21:39 host002 kernel: e820: BIOS-provided physical RAM map:
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x0000000000000000-0x0000000000099fff] usable
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x000000000009a000-0x000000000009ffff] reserved
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bf77ffff] usable
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000bf78e000-0x00000000bf78ffff] type 9
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000bf790000-0x00000000bf79dfff] ACPI data
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000bf79e000-0x00000000bf7cffff] ACPI NVS
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000bf7d0000-0x00000000bf7dffff] reserved
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000bf7ec000-0x00000000bfffffff] reserved
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x00000000ffc00000-0x00000000ffffffff] reserved
Nov  7 10:21:39 host002 kernel: BIOS-e820: [mem 0x0000000100000000-0x0000000abfffffff] usable
Nov  7 10:21:39 host002 kernel: NX (Execute Disable) protection: active
Nov  7 10:21:39 host002 kernel: SMBIOS 2.6 present.
Nov  7 10:21:39 host002 kernel: e820: last_pfn = 0xac0000 max_arch_pfn = 0x400000000

iliaxxx
() автор топика

Если обновлений не ставил, копай в сторону железа - проверь температуру, погоняй мемтест, чекни диски.

Deleted
()

палать

Горіла сосна, палала.

post-factum ★★★★★
()

Примени метод исключения. Так было всегда на этом железе? Что изменилось?

targitaj ★★★★★
()
Ответ на: комментарий от Deleted

Ну дык повисающий насмерть хост обычно не успевает ничего отписать в логах по причине повисания. Я такое видел не один раз.

targitaj ★★★★★
()
Ответ на: комментарий от targitaj

Коллеги спасибо всем. Очевидное стало явным.

host002 kernel: EDAC MC1: 5 CE error on CPU#1Channel#0_DIMM#0 (channel:0 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)
Умирает планка памяти.

iliaxxx
() автор топика
Вы не можете добавлять комментарии в эту тему. Тема перемещена в архив.