Всем привет!
Ночью во время бэкапа начинает происходить не очень понятное мне событие - память на сервере перестает выделяться приложениям, и они, ожидая ее, завершаются с грязным выводом в syslog.
Хотел помониторить потребление RAM, сделал перенаправление /bin/ps aux --sort -rss | head -n 15
в файл ежеминутно - а по итогу файл заполнялся только тогда, когда систему отпускало, результата в итоге не получил.
В логах примерно такое, причем на множество софта, запущенного в этот момент -
gzip: page allocation stalls for 10412ms, order:0, mode:0x34200ca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE)
Oct 8 23:01:45 hostname kernel: [6532932.043362] CPU: 5 PID: 10216 t 4.9.0-3-amd64 #1 Debian 4.9.30-2+deb9u5
Oct 8 23:01:45 hostname kernel: [6532932.043363] Hardware name: Intel Corporation S5500BC/S5500BC, BIOS S5500.86B.01.00.0061.030920121535 03/09/2012
Oct 8 23:01:45 hostname kernel: [6532932.043365] 0000000000000000 ffffffff813285b4 ffffffff819febb0 ffffb591c3adfa88
Oct 8 23:01:45 hostname kernel: [6532932.043369] ffffffff81184f3a 034200ca00000006 ffffffff819febb0 ffffb591c3adfa28
Oct 8 23:01:45 hostname kernel: [6532932.043371] ffff8d1400000010 ffffb591c3adfa98 ffffb591c3adfa48 e9e70cf673c824f5
Oct 8 23:01:45 hostname kernel: [6532932.043374] Call Trace:
Oct 8 23:01:45 hostname kernel: [6532932.043383] [<ffffffff813285b4>] ? dump_stack+0x5c/0x78
Вот, например, упавший агент icinga2:
Oct 8 22:59:49 hostname kernel: [6532815.626800] icinga2: page allocation stalls for 37044ms, order:0, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Oct 8 22:59:49 hostname kernel: [6532815.626809] CPU: 12 PID: 22483 Comm: icinga2 Not tainted 4.9.0-3-amd64 #1 Debian 4.9.30-2+deb9u5
Oct 8 22:59:49 hostname kernel: [6532815.626810] Hardware name: Intel Corporation S5500BC/S5500BC, BIOS S5500.86B.01.00.0061.030920121535 03/09/2012
Oct 8 22:59:49 hostname kernel: [6532815.626812] 0000000000000000 ffffffff813285b4 ffffffff819febb0 ffffb591c50b3970
Oct 8 22:59:49 hostname kernel: [6532815.626815] ffffffff81184f3a 026040c00d6284b6 ffffffff819febb0 ffffb591c50b3910
Oct 8 22:59:49 hostname kernel: [6532815.626818] 0000000000000010 ffffb591c50b3980 ffffb591c50b3930 7a32cddd0d6284b6
Oct 8 22:59:49 hostname kernel: [6532815.626821] Call Trace:
Oct 8 22:59:49 hostname kernel: [6532815.626827] [<ffffffff813285b4>] ? dump_stack+0x5c/0x78
Oct 8 22:59:49 hostname kernel: [6532815.626831] [<ffffffff81184f3a>] ? warn_alloc+0x13a/0x160
Oct 8 22:59:49 hostname kernel: [6532815.626833] [<ffffffff8118592d>] ? __alloc_pages_slowpath+0x95d/0xbc0
Oct 8 22:59:49 hostname kernel: [6532815.626836] [<ffffffff81185d8e>] ? __alloc_pages_nodemask+0x1fe/0x260
Oct 8 22:59:49 hostname kernel: [6532815.626840] [<ffffffff811de16d>] ? cache_grow_begin+0x9d/0x550
Oct 8 22:59:49 hostname kernel: [6532815.626841] [<ffffffff811de16d>] ? cache_grow_begin+0x9d/0x550
Oct 8 22:59:49 hostname kernel: [6532815.626843] [<ffffffff811de8d1>] ? fallback_alloc+0x161/0x200
Oct 8 22:59:49 hostname kernel: [6532815.626877] [<ffffffffc0345935>] ? ext4_htree_store_dirent+0x35/0x100 [ext4]
Oct 8 22:59:49 hostname kernel: [6532815.626879] [<ffffffff811df761>] ? __kmalloc+0x1e1/0x560
Oct 8 22:59:49 hostname kernel: [6532815.626894] [<ffffffffc0373e20>] ? ext4_encrypted_get_link+0x260/0x260 [ext4]
Oct 8 22:59:49 hostname kernel: [6532815.626908] [<ffffffffc0374022>] ? ext4fs_dirhash+0xc2/0x2a0 [ext4]
Oct 8 22:59:49 hostname kernel: [6532815.626919] [<ffffffffc0345935>] ? ext4_htree_store_dirent+0x35/0x100 [ext4]
Oct 8 22:59:49 hostname kernel: [6532815.626932] [<ffffffffc0357ff8>] ? htree_dirblock_to_tree+0xf8/0x2b0 [ext4]
Oct 8 22:59:49 hostname kernel: [6532815.626935] [<ffffffff8120bf19>] ? terminate_walk+0x89/0xf0
Oct 8 22:59:49 hostname kernel: [6532815.626948] [<ffffffffc03591a6>] ? ext4_htree_fill_tree+0xa6/0x2e0 [ext4]
Oct 8 22:59:49 hostname kernel: [6532815.626950] [<ffffffff811dfbfb>] ? kmem_cache_alloc_trace+0x11b/0x530
Oct 8 22:59:49 hostname kernel: [6532815.626961] [<ffffffffc03455b2>] ? ext4_readdir+0x6f2/0xa20 [ext4]
Oct 8 22:59:49 hostname kernel: [6532815.626963] [<ffffffff8121675b>] ? iterate_dir+0x16b/0x190
Oct 8 22:59:49 hostname kernel: [6532815.626965] [<ffffffff81216c48>] ? SyS_getdents+0x98/0x120
Oct 8 22:59:49 hostname kernel: [6532815.626966] [<ffffffff812169b0>] ? fillonedir+0xe0/0xe0
Oct 8 22:59:49 hostname kernel: [6532815.626969] [<ffffffff8160637b>] ? system_call_fast_compare_end+0xc/0x9b
Oct 8 22:59:49 hostname kernel: [6532815.626970] Mem-Info:
Oct 8 22:59:49 hostname kernel: [6532815.626977] active_anon:3517597 inactive_anon:440299 isolated_anon:0
Oct 8 22:59:49 hostname kernel: [6532815.626977] active_file:3429627 inactive_file:642068 isolated_file:1441
Oct 8 22:59:49 hostname kernel: [6532815.626977] unevictable:2868 dirty:636975 writeback:6355 unstable:0
Oct 8 22:59:49 hostname kernel: [6532815.626977] slab_reclaimable:57099 slab_unreclaimable:18213
Oct 8 22:59:49 hostname kernel: [6532815.626977] mapped:2832938 shmem:3872203 pagetables:26126 bounce:0
Oct 8 22:59:49 hostname kernel: [6532815.626977] free:40205 free_pcp:0 free_cma:0
Oct 8 22:59:49 hostname kernel: [6532815.626984] Node 0 active_anon:7068896kB inactive_anon:885492kB active_file:6358596kB inactive_file:1598932kB unevictable:5164kB isolated(anon):0kB$
Oct 8 22:59:49 hostname kernel: [6532815.626990] Node 1 active_anon:7001492kB inactive_anon:875704kB active_file:7359912kB inactive_file:969340kB unevictable:6308kB isolated(anon):0kB $
Oct 8 22:59:49 hostname kernel: [6532815.626991] Node 0 DMA free:15892kB min:40kB low:52kB high:64kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB$
Oct 8 22:59:49 hostname kernel: [6532815.626996] lowmem_reserve[]: 0 2113 15921 15921 15921
Oct 8 22:59:49 hostname kernel: [6532815.626999] Node 0 DMA32 free:61108kB min:5940kB low:8104kB high:10268kB active_anon:938576kB inactive_anon:219852kB active_file:713228kB inactive_$
Oct 8 22:59:49 hostname kernel: [6532815.627004] lowmem_reserve[]: 0 0 13807 13807 13807
Oct 8 22:59:49 hostname kernel: [6532815.627007] Node 0 Normal free:38584kB min:38804kB low:52940kB high:67076kB active_anon:6130320kB inactive_anon:665640kB active_file:5645368kB inac$
Oct 8 22:59:49 hostname kernel: [6532815.627012] lowmem_reserve[]: 0 0 0 0 0
Oct 8 22:59:49 hostname kernel: [6532815.627016] Node 1 Normal free:45236kB min:45320kB low:61832kB high:78344kB active_anon:7001492kB inactive_anon:875704kB active_file:7359912kB inac$
Oct 8 22:59:49 hostname kernel: [6532815.627021] lowmem_reserve[]: 0 0 0 0 0
Oct 8 22:59:49 hostname kernel: [6532815.627023] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15892$
Oct 8 22:59:49 hostname kernel: [6532815.627035] Node 0 DMA32: 1192*4kB (UME) 1205*8kB (UME) 625*16kB (UME) 247*32kB (UME) 202*64kB (UME) 68*128kB (UME) 19*256kB (UME) 5*512kB (UE) 0*1$
Oct 8 22:59:49 hostname kernel: [6532815.627045] Node 0 Normal: 8066*4kB (UME) 237*8kB (UME) 456*16kB (ME) 2*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 41520kB
Oct 8 22:59:49 hostname kernel: [6532815.627055] Node 1 Normal: 5226*4kB (UME) 2356*8kB (UME) 467*16kB (UME) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 47224kB
Oct 8 22:59:49 hostname kernel: [6532815.627066] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Oct 8 22:59:49 hostname kernel: [6532815.627067] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct 8 22:59:49 hostname kernel: [6532815.627068] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Oct 8 22:59:49 hostname kernel: [6532815.627070] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct 8 22:59:49 hostname kernel: [6532815.627070] 7952000 total pagecache pages
Oct 8 22:59:49 hostname kernel: [6532815.627071] 5048 pages in swap cache
Oct 8 22:59:49 hostname kernel: [6532815.627073] Swap cache stats: add 67106362, delete 67101314, find 213951398/230720868
Oct 8 22:59:49 hostname kernel: [6532815.627073] Free swap = 20329928kB
Oct 8 22:59:49 hostname kernel: [6532815.627074] Total swap = 31249404kB
Oct 8 22:59:49 hostname kernel: [6532815.627075] 8364727 pages RAM
Oct 8 22:59:49 hostname kernel: [6532815.627075] 0 pages HighMem/MovableOnly
Oct 8 22:59:49 hostname kernel: [6532815.627076] 150907 pages reserved
Oct 8 22:59:49 hostname kernel: [6532815.627076] 0 pages hwpoisoned
Чем можно подебажить такое поведение?
Потом систему отпускает, длится это все безобразие около 2х часов, и после все продолжает работать как ни в чем не бывало.
тут вывод free -h
free -h
total used free shared buff/cache available
Mem: 31G 3,2G 1,1G 14G 27G 12G
Swap: 29G 6,9G 22G