Здравствуйте! Есть проблема с периодическим (1-3 раза в сутки) падением сетевых интерфейсов, обращенных в локалку. Машина служит файл-сервером и шлюзом в небольшой локальной сети, Ubuntu desktop 11.10, samba, shorewall, dnsmasq, ifenslave и т.д.:
Linux bgps 3.0.0-12-generic #20-Ubuntu SMP Fri Oct 7 14:56:25 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
На ней 3 сетевых карты:
04:00.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05) — eth1 04:01.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05) — eth2 05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03) — eth0 (на материнке)
eth0 смотрит наружу, eth1 и eth2 в бондинге (bond0) в локалку. Так вот, периодически с завидным постоянством сначала в down уходит одна, а через некоторое время вторая карта, и сеть остается без файл-сервера и интернета. Вот что сыпется в syslog:
…
Oct 24 19:14:07 bgps kernel: [35290.895387] irq 16: nobody cared (try booting with the "irqpoll" option)
Oct 24 19:14:07 bgps kernel: [35290.895394] Pid: 0, comm: swapper Tainted: G C 3.0.0-12-generic #20-Ubuntu
Oct 24 19:14:07 bgps kernel: [35290.895397] Call Trace:
Oct 24 19:14:07 bgps kernel: [35290.895399] <IRQ> [<ffffffff810cf8ad>] __report_bad_irq+0x3d/0xe0
Oct 24 19:14:07 bgps kernel: [35290.895410] [<ffffffff810cfcd5>] note_interrupt+0x135/0x180
Oct 24 19:14:07 bgps kernel: [35290.895415] [<ffffffff810cdcc9>] handle_irq_event_percpu+0xa9/0x220
Oct 24 19:14:07 bgps kernel: [35290.895419] [<ffffffff810cde8e>] handle_irq_event+0x4e/0x80
Oct 24 19:14:07 bgps kernel: [35290.895423] [<ffffffff810d0604>] handle_fasteoi_irq+0x64/0xf0
Oct 24 19:14:07 bgps kernel: [35290.895427] [<ffffffff8100c252>] handle_irq+0x22/0x40
Oct 24 19:14:07 bgps kernel: [35290.895431] [<ffffffff815f3d2a>] do_IRQ+0x5a/0xe0
Oct 24 19:14:07 bgps kernel: [35290.895435] [<ffffffff815ea413>] common_interrupt+0x13/0x13
Oct 24 19:14:07 bgps kernel: [35290.895437] <EOI> [<ffffffff813496cb>] ? intel_idle+0xcb/0x120
Oct 24 19:14:07 bgps kernel: [35290.895446] [<ffffffff813496ad>] ? intel_idle+0xad/0x120
Oct 24 19:14:07 bgps kernel: [35290.895451] [<ffffffff814ab062>] cpuidle_idle_call+0xa2/0x1d0
Oct 24 19:14:07 bgps kernel: [35290.895456] [<ffffffff8100920b>] cpu_idle+0xab/0x100
Oct 24 19:14:07 bgps kernel: [35290.895462] [<ffffffff815b803e>] rest_init+0x72/0x74
Oct 24 19:14:07 bgps kernel: [35290.895466] [<ffffffff81ad0c2b>] start_kernel+0x3d4/0x3df
Oct 24 19:14:07 bgps kernel: [35290.895471] [<ffffffff81ad0388>] x86_64_start_reservations+0x132/0x136
Oct 24 19:14:07 bgps kernel: [35290.895476] [<ffffffff81ad0140>] ? early_idt_handlers+0x140/0x140
Oct 24 19:14:07 bgps kernel: [35290.895481] [<ffffffff81ad0459>] x86_64_start_kernel+0xcd/0xdc
Oct 24 19:14:07 bgps kernel: [35290.895483] handlers:
Oct 24 19:14:07 bgps kernel: [35290.895506] [<ffffffffa00846c0>] e1000_intr
Oct 24 19:14:07 bgps kernel: [35290.895509] Disabling IRQ #16
…
Oct 24 19:15:18 bgps kernel: [35362.161750] ------------[ cut here ]------------
Oct 24 19:15:18 bgps kernel: [35362.161760] WARNING: at /build/buildd/linux-3.0.0/net/sched/sch_generic.c:255 dev_watchdog+0x25a/0x270()
Oct 24 19:15:18 bgps kernel: [35362.161763] Hardware name: System Product Name
Oct 24 19:15:18 bgps kernel: [35362.161765] NETDEV WATCHDOG: eth1 (e1000): transmit queue 0 timed out
Oct 24 19:15:18 bgps kernel: [35362.161767] Modules linked in: bnep rfcomm bluetooth act_police cls_flow cls_fw cls_u32 sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_time xt_connlimit xt_realm xt_addrtype iptable_raw xt_comment xt_recent xt_policy ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah xt_set ip_set nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp nf_conntrack_amanda nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip nf_conntrack_proto_sctp nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp xt_TPROXY nf_tproxy_core ip6_tables nf_defrag_ipv6 xt_tcpmss xt_pkttype xt_physdev xt_owner xt_NFQUEUE xt_NFLOG nfnetlink_log xt_multiport xt_mark xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP xt_dscp xt_dccp xt_conntrack xt_connmark
Oct 24 19:15:18 bgps kernel: xt_CLASSIFY xt_AUDIT ipt_LOG xt_tcpudp xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink iptable_filter ip_tables x_tables parport_pc ppdev bonding binfmt_misc snd_hda_codec_hdmi snd_hda_codec_realtek usbhid hid eeepc_wmi asus_wmi sparse_keymap snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq psmouse snd_timer snd_seq_device i915 serio_raw snd drm_kms_helper wmi drm i2c_algo_bit video soundcore snd_page_alloc mei(C) lp parport raid456 async_pq async_xor raid10 xor async_memcpy async_raid6_recov e1000 ahci libahci xhci_hcd r8169 raid6_pq async_tx raid1 raid0 multipath linear
Oct 24 19:15:18 bgps kernel: [35362.161887] Pid: 0, comm: kworker/0:0 Tainted: G C 3.0.0-12-generic #20-Ubuntu
Oct 24 19:15:18 bgps kernel: [35362.161890] Call Trace:
Oct 24 19:15:18 bgps kernel: [35362.161892] <IRQ> [<ffffffff8105e83f>] warn_slowpath_common+0x7f/0xc0
Oct 24 19:15:18 bgps kernel: [35362.161902] [<ffffffff8105e936>] warn_slowpath_fmt+0x46/0x50
Oct 24 19:15:18 bgps kernel: [35362.161910] [<ffffffff814f6afa>] dev_watchdog+0x25a/0x270
Oct 24 19:15:18 bgps kernel: [35362.161915] [<ffffffff81011593>] ? native_sched_clock+0x13/0x60
Oct 24 19:15:18 bgps kernel: [35362.161920] [<ffffffff8107a270>] ? __queue_work+0x320/0x320
Oct 24 19:15:18 bgps kernel: [35362.161924] [<ffffffff814f68a0>] ? qdisc_reset+0x50/0x50
Oct 24 19:15:18 bgps kernel: [35362.161927] [<ffffffff814f68a0>] ? qdisc_reset+0x50/0x50
Oct 24 19:15:18 bgps kernel: [35362.161932] [<ffffffff8106d596>] call_timer_fn+0x46/0x160
Oct 24 19:15:18 bgps kernel: [35362.161935] [<ffffffff814f68a0>] ? qdisc_reset+0x50/0x50
Oct 24 19:15:18 bgps kernel: [35362.161938] [<ffffffff8106eec2>] run_timer_softirq+0x132/0x2a0
Oct 24 19:15:18 bgps kernel: [35362.161943] [<ffffffff81026d5d>] ? lapic_next_event+0x1d/0x30
Oct 24 19:15:18 bgps kernel: [35362.161947] [<ffffffff81065f58>] __do_softirq+0xa8/0x210
Oct 24 19:15:18 bgps kernel: [35362.161952] [<ffffffff8109388f>] ? tick_program_event+0x1f/0x30
Oct 24 19:15:18 bgps kernel: [35362.161956] [<ffffffff815f34dc>] call_softirq+0x1c/0x30
Oct 24 19:15:18 bgps kernel: [35362.161960] [<ffffffff8100c2d5>] do_softirq+0x65/0xa0
Oct 24 19:15:18 bgps kernel: [35362.161963] [<ffffffff8106633e>] irq_exit+0x8e/0xb0
Oct 24 19:15:18 bgps kernel: [35362.161967] [<ffffffff815f3e1e>] smp_apic_timer_interrupt+0x6e/0x99
Oct 24 19:15:18 bgps kernel: [35362.161971] [<ffffffff815f2c93>] apic_timer_interrupt+0x13/0x20
Oct 24 19:15:18 bgps kernel: [35362.161973] <EOI> [<ffffffff813496cb>] ? intel_idle+0xcb/0x120
Oct 24 19:15:18 bgps kernel: [35362.161982] [<ffffffff813496ad>] ? intel_idle+0xad/0x120
Oct 24 19:15:18 bgps kernel: [35362.161987] [<ffffffff814ab062>] cpuidle_idle_call+0xa2/0x1d0
Oct 24 19:15:18 bgps kernel: [35362.161992] [<ffffffff8100920b>] cpu_idle+0xab/0x100
Oct 24 19:15:18 bgps kernel: [35362.161996] [<ffffffff815cc2a6>] start_secondary+0xd9/0xdb
Oct 24 19:15:18 bgps kernel: [35362.161999] ---[ end trace 8b1848cfe680f7c8 ]---
Oct 24 19:15:18 bgps kernel: [35362.273731] bonding: bond0: link status down for interface eth1, disabling it in 200 ms.
Oct 24 19:15:18 bgps kernel: [35362.473689] bonding: bond0: link status definitely down for interface eth1, disabling it
Oct 24 19:15:18 bgps kernel: [35362.473696] bonding: bond0: making interface eth2 the new active one.
…
Oct 24 23:15:45 bgps kernel: [49786.062390] irq 17: nobody cared (try booting with the "irqpoll" option)
Oct 24 23:15:45 bgps kernel: [49786.062397] Pid: 0, comm: swapper Tainted: G WC 3.0.0-12-generic #20-Ubuntu
Oct 24 23:15:45 bgps kernel: [49786.062400] Call Trace:
Oct 24 23:15:45 bgps kernel: [49786.062402] <IRQ> [<ffffffff810cf8ad>] __report_bad_irq+0x3d/0xe0
Oct 24 23:15:45 bgps kernel: [49786.062414] [<ffffffff810cfcd5>] note_interrupt+0x135/0x180
Oct 24 23:15:45 bgps kernel: [49786.062418] [<ffffffff810cdcc9>] handle_irq_event_percpu+0xa9/0x220
Oct 24 23:15:45 bgps kernel: [49786.062422] [<ffffffff810cde8e>] handle_irq_event+0x4e/0x80
Oct 24 23:15:45 bgps kernel: [49786.062426] [<ffffffff810d0604>] handle_fasteoi_irq+0x64/0xf0
Oct 24 23:15:45 bgps kernel: [49786.062430] [<ffffffff8100c252>] handle_irq+0x22/0x40
Oct 24 23:15:45 bgps kernel: [49786.062434] [<ffffffff815f3d2a>] do_IRQ+0x5a/0xe0
Oct 24 23:15:45 bgps kernel: [49786.062438] [<ffffffff815ea413>] common_interrupt+0x13/0x13
Oct 24 23:15:45 bgps kernel: [49786.062440] <EOI> [<ffffffff814aad81>] ? poll_idle+0x41/0x80
Oct 24 23:15:45 bgps kernel: [49786.062448] [<ffffffff814aad53>] ? poll_idle+0x13/0x80
Oct 24 23:15:45 bgps kernel: [49786.062451] [<ffffffff814ab062>] cpuidle_idle_call+0xa2/0x1d0
Oct 24 23:15:45 bgps kernel: [49786.062457] [<ffffffff8100920b>] cpu_idle+0xab/0x100
Oct 24 23:15:45 bgps kernel: [49786.062462] [<ffffffff815b803e>] rest_init+0x72/0x74
Oct 24 23:15:45 bgps kernel: [49786.062466] [<ffffffff81ad0c2b>] start_kernel+0x3d4/0x3df
Oct 24 23:15:45 bgps kernel: [49786.062471] [<ffffffff81ad0388>] x86_64_start_reservations+0x132/0x136
Oct 24 23:15:45 bgps kernel: [49786.062476] [<ffffffff81ad0140>] ? early_idt_handlers+0x140/0x140
Oct 24 23:15:45 bgps kernel: [49786.062481] [<ffffffff81ad0459>] x86_64_start_kernel+0xcd/0xdc
Oct 24 23:15:45 bgps kernel: [49786.062483] handlers:
Oct 24 23:15:45 bgps kernel: [49786.062506] [<ffffffffa00846c0>] e1000_intr
Oct 24 23:15:45 bgps kernel: [49786.062509] Disabling IRQ #17
Oct 24 23:17:01 bgps CRON[5856]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Oct 24 23:17:13 bgps kernel: [49874.070647] bonding: bond0: link status down for interface eth2, disabling it in 200 ms.
Oct 24 23:17:13 bgps kernel: [49874.270611] bonding: bond0: link status definitely down for interface eth2, disabling it
Oct 24 23:17:13 bgps kernel: [49874.270964] bonding: bond0: now running without any active interface !
…
~$ cat /etc/network/interfaces
auto lo
iface lo inet loopback
auto eth0
iface eth0 inet static
address ***.***.***.***
netmask ***.***.***.***
gateway ***.***.***.***
auto bond0
iface bond0 inet static
address 10.0.0.1
netmask 255.0.0.0
bond_mode balance-tlb
bond_miimon 100
bond_downdelay 200
bond_updelay 200
slaves eth1 eth2
Скажите как быть? Мониторить скриптом интерфейсы и перезапускать сеть при падении проблему не решит.