История изменений

Исправление manul91, 25.10.23 23:41 (текущая версия) :

У меня было подобное (и опять, никак не гуглилось). И не только во время бэкапов, а иногда виндовые виртуалки падали случайным образом.

Вылечилось настроек следующими параметрами systcl ядра (для дебиана закинуть надо в файлик в /etc/sysctl.d/) Копирую вместе с коментариями, которые надеюсь, поясняют суть проблемы:

# NEXT SETTINGS ARE TO AVOID SWAPPING OF APPS DUE TO TOO LARGE FILE CACHES IN MEMORY (VM disks backup)
# Reducing these is relevant with high-memory systems (>16G) to reduce absolute amounts which has to be written to disk at once
# Reducing these as below, seems to fix an issue with Win2K3 VMs sometimes spontaneously blue-screen-ing due to "hardware error" (assumingly when the caches get flushed gets too long time)

# At the default value of vfs_cache_pressure=100 the kernel will attempt to reclaim dentries and inodes at a "fair" rate with respect to pagecache and swapcache 
# reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will never reclaim 
# dentries and inodes due to memory pressure and this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 causes the kernel 
# to prefer to reclaim dentries and inodes. 
# From another resource: 
# Setting vfs_cache_pressure to low value makes sense because in most cases, the kernel needs to know the directory structure before it can use file 
# contents from the cache and flushing the directory cache too soon will make the file cache next to worthless. Consider going all the way down to 1 with 
# this setting if you have lots of files. Setting this to big value is sensible only if you have only a few big files that are constantly being re-read.
# V: INCREASE IT (it defaults to 100) to AVOID backuping huge VM disk files to kick qemu in swap! See last remark above.
# vm.vfs_cache_pressure = 300
vm.vfs_cache_pressure = 150

# dirty_ratio, dirty_background_ratio:
# Lowering them from standard values causes everything to be flushed to disk rather than storing much in RAM. It helps large memory systems, which would normally 
# flush a 45G-90G pagecache to disk, causing huge wait times for front-end applications, decreasing overall responsiveness and interactivity. 
# https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/

# vm.dirty_ratio is percentage of system memory which when dirty, *the process doing writes* would block and write out dirty pages to the disks
# Contains, as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which a process which is generating disk writes will itself start writing out dirty data.
# tell the kernel to use up to 4% (96G*0.02=~4G) of the RAM as cache for writes (defaults to 20%)
# Try to decrease to see whether a spontaneous shutdown (Hardware failure) of the VMs will cease

# vm.dirty_ratio is the absolute maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk. When the 
#       system gets to this point all new I/O blocks until dirty pages have been written to disk. This is often the source of long I/O pauses, but is a 
#       safeguard against too much data being cached unsafely in memory.
# It is a percentage
# Defaults to 20
# Has to be small for a VM host with slow IO (classic HDDs) so it does not "stall" for long when flushing which VMs do not like!!
vm.dirty_ratio = 5

# vm.dirty_background_ratio is the percentage of system memory which when dirty *then system* can start writing data to the disks
# Contains, as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which the background kernel flusher threads will start writing out dirty data.
# instruct kernel to use up to 1% of RAM (~=1G) before slowing down the process that's writing (default for dirty_background_ratio is 10).
# Try to decrease to see whether a spontaneous shutdown (Hardware failure) of the VMs will cease

# vm.dirty_background_ratio is the percentage of system memory that can be filled with “dirty” pages — memory pages that still need to be written to disk — 
#       before the pdflush/flush/kdmflush background processes kick in to write it to disk. My example is 10%, so if my virtual server has 32 GB of memory 
#       that’s 3.2 GB of data that can be sitting in RAM before something is done.
# It is a percentage
# Defaults to 10
# Has to be small for a VM host with slow IO (classic HDDs) so it does not "stall" for long when flushing which VMs do not like!!
vm.dirty_background_ratio = 1

# if you need settin even lower than 1% (server with very large mem), use the absolute value based sister 
# parameters: vm.dirty_background_bytes and vm.dirty_bytes which can be tuned exactly

По сути дело было в том, что очевидно доходило до критического предела vm.dirty_ratio (так как грязные страницы росли быстрее чем оно сумевало flush-ить их на диски) - тогда практически все IO подвисает пока процент грязных страниц не упадет под vm.dirty_ratio. Виндовым виртуалкам это не нравилось.

Так что попробуйте уменьшить (в зависимости от памяти на сервере) vm.dirty_ratio = 5 vm.dirty_background_ratio = 1

vm.dirty_background_ratio должно быть меньше чем vm.dirty_ratio как минимум раза в три. Это все в процентах от памяти

Идея в том, чтобы vm.dirty_background_ratio включался пораньше чтобы скидывать грязные страницы памяти на диск, и дело не доходило до vm.dirty_ratio А даже если и дошло бы, то «подвисание» будет намного короче (так как нужно скинуть намного меньшего объема) и виртуалки падать не будут. В моем случае, это помогло.

Исходная версия manul91, 25.10.23 23:27:

У меня было подобное (и опять, никак не гуглилось).

# NEXT SETTINGS ARE TO AVOID SWAPPING OF APPS DUE TO TOO LARGE FILE CACHES IN MEMORY (VM disks backup)
# Reducing these is relevant with high-memory systems (>16G) to reduce absolute amounts which has to be written to disk at once
# Reducing these as below, seems to fix an issue with Win2K3 VMs sometimes spontaneously blue-screen-ing due to "hardware error" (assumingly when the caches get flushed gets too long time)

# At the default value of vfs_cache_pressure=100 the kernel will attempt to reclaim dentries and inodes at a "fair" rate with respect to pagecache and swapcache 
# reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will never reclaim 
# dentries and inodes due to memory pressure and this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 causes the kernel 
# to prefer to reclaim dentries and inodes. 
# From another resource: 
# Setting vfs_cache_pressure to low value makes sense because in most cases, the kernel needs to know the directory structure before it can use file 
# contents from the cache and flushing the directory cache too soon will make the file cache next to worthless. Consider going all the way down to 1 with 
# this setting if you have lots of files. Setting this to big value is sensible only if you have only a few big files that are constantly being re-read.
# V: INCREASE IT (it defaults to 100) to AVOID backuping huge VM disk files to kick qemu in swap! See last remark above.
# vm.vfs_cache_pressure = 300
vm.vfs_cache_pressure = 150

# dirty_ratio, dirty_background_ratio:
# Lowering them from standard values causes everything to be flushed to disk rather than storing much in RAM. It helps large memory systems, which would normally 
# flush a 45G-90G pagecache to disk, causing huge wait times for front-end applications, decreasing overall responsiveness and interactivity. 
# https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/

# vm.dirty_ratio is percentage of system memory which when dirty, *the process doing writes* would block and write out dirty pages to the disks
# Contains, as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which a process which is generating disk writes will itself start writing out dirty data.
# tell the kernel to use up to 4% (96G*0.02=~4G) of the RAM as cache for writes (defaults to 20%)
# Try to decrease to see whether a spontaneous shutdown (Hardware failure) of the VMs will cease

# vm.dirty_ratio is the absolute maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk. When the 
#       system gets to this point all new I/O blocks until dirty pages have been written to disk. This is often the source of long I/O pauses, but is a 
#       safeguard against too much data being cached unsafely in memory.
# It is a percentage
# Defaults to 20
# Has to be small for a VM host with slow IO (classic HDDs) so it does not "stall" for long when flushing which VMs do not like!!
vm.dirty_ratio = 5

# vm.dirty_background_ratio is the percentage of system memory which when dirty *then system* can start writing data to the disks
# Contains, as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which the background kernel flusher threads will start writing out dirty data.
# instruct kernel to use up to 1% of RAM (~=1G) before slowing down the process that's writing (default for dirty_background_ratio is 10).
# Try to decrease to see whether a spontaneous shutdown (Hardware failure) of the VMs will cease

# vm.dirty_background_ratio is the percentage of system memory that can be filled with “dirty” pages — memory pages that still need to be written to disk — 
#       before the pdflush/flush/kdmflush background processes kick in to write it to disk. My example is 10%, so if my virtual server has 32 GB of memory 
#       that’s 3.2 GB of data that can be sitting in RAM before something is done.
# It is a percentage
# Defaults to 10
# Has to be small for a VM host with slow IO (classic HDDs) so it does not "stall" for long when flushing which VMs do not like!!
vm.dirty_background_ratio = 1

# if you need settin even lower than 1% (server with very large mem), use the absolute value based sister 
# parameters: vm.dirty_background_bytes and vm.dirty_bytes which can be tuned exactly

Так что попробуйте уменьшить (в зависимости от памяти на сервере) vm.dirty_ratio = 5 vm.dirty_background_ratio = 1

vm.dirty_background_ratio должно быть меньше чем vm.dirty_ratio как минимум раза в три. Это все в процентах от памяти