Снова нужен совет от сильного комунити ЛОРа. Существует нищебродский кластер ынтерпрайз сегмента из 2-ух серверов (+ qdevice) где установлены corosync 3.0.1 , Pacemaker 2.0.1 и drbd 8.4.10. Созданы и объявлены в кластере 4 drbd ресурса в режиме primary/secondary и FS на них с точками монтирования. Вобщем-то вроде бы на первый взгляд всё работает. Как проверяем? Убеждаемся что все эти ресурсы находятся на node1 и затем просто пишем reboot
или с любыми флагами (--force
, --halt
и т.п.), можем вообще через ipmi сделать shutdown warn/cold
или даже power cycle
, что происходит в этом случае? Грубо говоря, кластер на node2 видит что отвалилась node1, перетаскивает на себя drbd, говорит им что они теперь primary и монтирует. После того как появляется node1, кластер на ней синхронизирует drbd ресурсы как secondary и всё продолжает работать и-де-ально-но и рад бы я тут поставить точку, чтоб вы просто порадовались за меня, ан нет, увы. В общем как бы я не пробовал разломать кластер - он всегда выживает за исключением одного кейса: ip l s bond0 down
на одной ноде. Что происходит? Нода теряет кворум, та живая, что осталась видит это, видит свой кворум, поднимает у себя все drbd и stonith посылает на ту, что без сети fence reboot, разумеется та нода ребутается и… зачем-то поднимает все drbd как primary из-за чего случается split brain. Тут вопрос скорее в том, чтоб мне какие-то сильные HA admins объяснили - почему оно так делает? Чем это состояние отличается от того же reboot --force
или power cycle
? Можно ли как-то избежать этого, или у pacemaker это архитектурно задумано так?
Aug 7 14:32:26 node1 pacemakerd[1239]: notice: Quorum acquired
Aug 7 14:32:26 node1 pacemaker-controld[1257]: notice: Node node2 state is now member
Aug 7 14:32:26 node1 pacemakerd[1239]: notice: Node node2 state is now member
Aug 7 14:32:26 node1 pacemaker-controld[1257]: notice: State transition S_IDLE -> S_INTEGRATION
Aug 7 14:32:26 node1 pacemaker-controld[1257]: warning: Another DC detected: node2 (op=noop)
Aug 7 14:32:26 node1 pacemaker-controld[1257]: notice: State transition S_ELECTION -> S_RELEASE_DC
Aug 7 14:32:26 node1 pacemaker-controld[1257]: notice: State transition S_PENDING -> S_NOT_DC
Aug 7 14:32:26 node1 pacemaker-attrd[1255]: notice: Detected another attribute writer (node2), starting new election
Aug 7 14:32:26 node1 kernel: [ 90.738799] block drbd2: drbd_bm_resize called with capacity == 157281528
Aug 7 14:32:26 node1 kernel: [ 90.739850] block drbd2: size = 75 GB (78640764 KB)
Aug 7 14:32:26 node1 kernel: [ 90.777678] block drbd2: 4 KB (1 bits) marked out-of-sync by on disk bit-map.
Aug 7 14:32:26 node1 kernel: [ 90.777685] block drbd2: attached to UUIDs 52BD2CD9F20589D9:5665978A6ED4A9BF:45609D915FEB64F7:455F9D915FEB64F7
Aug 7 14:32:26 node1 pacemaker-controld[1257]: notice: Result of start operation for node2.stonith on node1: 0 (ok)
Aug 7 14:32:26 node1 kernel: [ 90.867945] block drbd4: recounting of set bits took additional 0 jiffies
Aug 7 14:32:26 node1 kernel: [ 90.870319] block drbd4: disk( Attaching -> UpToDate )
Aug 7 14:32:26 node1 kernel: [ 90.889409] drbd docker1: Starting worker thread (from drbdsetup-84 [2310])
Aug 7 14:32:26 node1 kernel: [ 90.893568] block drbd3: disk( Diskless -> Attaching )
Aug 7 14:32:26 node1 kernel: [ 90.893627] drbd docker1: Method to ensure write ordering: flush
Aug 7 14:32:26 node1 kernel: [ 90.893631] block drbd3: drbd_bm_resize called with capacity == 157281528
Aug 7 14:32:26 node1 kernel: [ 90.893788] block drbd3: size = 75 GB (78640764 KB)
Aug 7 14:32:26 node1 kernel: [ 90.932991] block drbd3: recounting of set bits took additional 0 jiffies
Aug 7 14:32:26 node1 kernel: [ 90.932997] block drbd3: disk( Attaching -> UpToDate )
Aug 7 14:32:26 node1 kernel: [ 90.948787] drbd docker2: Starting receiver thread (from drbd_w_docker2 [2325])
Aug 7 14:32:26 node1 kernel: [ 90.948971] drbd docker2: conn( Unconnected -> WFConnection )
Aug 7 14:32:26 node1 kernel: [ 90.949273] drbd lxc2: Starting receiver thread (from drbd_w_lxc2 [2307])
Aug 7 14:32:26 node1 kernel: [ 90.949304] drbd lxc2: conn( Unconnected -> WFConnection )
Aug 7 14:32:26 node1 kernel: [ 90.949989] drbd lxc1: Starting receiver thread (from drbd_w_lxc1 [2314])
Aug 7 14:32:26 node1 kernel: [ 90.950041] drbd lxc1: conn( Unconnected -> WFConnection )
Aug 7 14:32:26 node1 pacemaker-execd[1254]: notice: drbd_docker1_start_0:2138:stderr [ Marked additional 24 MB as out-of-sync based on AL. ]
Aug 7 14:32:26 node1 pacemaker-controld[1257]: notice: Result of start operation for drbd_docker1 on node1: 0 (ok)
Aug 7 14:32:26 node1 pacemaker-controld[1257]: notice: Result of start operation for drbd_docker2 on node1: 0 (ok)
Aug 7 14:32:26 node1 pacemaker-controld[1257]: notice: Result of start operation for drbd_lxc2 on node1: 0 (ok)
Aug 7 14:32:26 node1 pacemaker-execd[1254]: notice: drbd_lxc1_start_0:2142:stderr [ Marked additional 92 MB as out-of-sync based on AL. ]
Aug 7 14:32:26 node1 pacemaker-controld[1257]: notice: Result of start operation for drbd_lxc1 on node1: 0 (ok)
Aug 7 14:32:26 node1 pacemaker-controld[1257]: notice: Result of notify operation for drbd_docker2 on node1: 0 (ok)
Aug 7 14:32:26 node1 pacemaker-controld[1257]: notice: Result of notify operation for drbd_lxc2 on node1: 0 (ok)
Aug 7 14:32:26 node1 pacemaker-controld[1257]: notice: Result of notify operation for drbd_docker1 on node1: 0 (ok)
Aug 7 14:32:26 node1 pacemaker-controld[1257]: notice: Result of notify operation for drbd_lxc1 on node1: 0 (ok)
Aug 7 14:32:27 node1 kernel: [ 91.478835] drbd lxc2: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
Aug 7 14:32:27 node1 kernel: [ 91.478857] drbd lxc1: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
Aug 7 14:32:27 node1 kernel: [ 91.481323] drbd lxc1: conn( WFConnection -> WFReportParams )
Aug 7 14:32:27 node1 kernel: [ 91.482389] drbd lxc2: Peer authenticated using 20 bytes HMAC
Aug 7 14:32:27 node1 kernel: [ 91.482450] drbd lxc2: conn( WFConnection -> WFReportParams )
Aug 7 14:32:27 node1 kernel: [ 91.482451] drbd lxc2: Starting ack_recv thread (from drbd_r_lxc2 [2362])
Aug 7 14:32:27 node1 kernel: [ 91.483381] drbd docker2: conn( WFConnection -> WFReportParams )
Aug 7 14:32:27 node1 kernel: [ 91.650900] block drbd2: helper command: /sbin/drbdadm initial-split-brain minor-2
Aug 7 14:32:27 node1 kernel: [ 91.652341] block drbd2: Split-Brain detected but unresolved, dropping connection!
Aug 7 14:32:27 node1 kernel: [ 91.653732] block drbd2: helper command: /sbin/drbdadm split-brain minor-2 exit code 0 (0x0)
Aug 7 14:32:27 node1 kernel: [ 91.653747] drbd lxc2: error receiving ReportState, e: -5 l: 0!
Aug 7 14:32:27 node1 kernel: [ 91.653756] drbd lxc2: Terminating drbd_a_lxc2
Aug 7 14:32:27 node1 kernel: [ 91.654441] block drbd4: self 367575DA2AC7BE0E:9F1A373A29FF8BD5:2787ACA9CCCEE11B:2786ACA9CCCEE11B bits:1 flags:0
Aug 7 14:32:27 node1 kernel: [ 91.654443] block drbd4: peer B9B0BAE43A1AF1DD:9F1A373A29FF8BD4:2787ACA9CCCEE11A:2786ACA9CCCEE11B bits:1 flags:0
Aug 7 14:32:27 node1 kernel: [ 91.654444] block drbd4: uuid_compare()=100 by rule 90
Aug 7 14:32:27 node1 kernel: [ 91.654447] block drbd4: helper command: /sbin/drbdadm initial-split-brain minor-4
Aug 7 14:32:27 node1 kernel: [ 91.655554] block drbd4: helper command: /sbin/drbdadm initial-split-brain minor-4 exit code 0 (0x0)
Aug 7 14:32:27 node1 kernel: [ 91.655564] block drbd4: helper command: /sbin/drbdadm split-brain minor-4
Aug 7 14:32:27 node1 kernel: [ 91.656622] drbd docker2: conn( WFReportParams -> Disconnecting )
Aug 7 14:32:27 node1 kernel: [ 91.656628] drbd docker2: ack_receiver terminated
Aug 7 14:32:27 node1 kernel: [ 91.666424] block drbd1: drbd_sync_handshake:
Aug 7 14:32:27 node1 kernel: [ 91.680968] block drbd1: peer 392EF5193CD03F05:4AE6A6A6E66E1C9C:8A1CC93CFB780678:8A1BC93CFB780679 bits:734 flags:0
Aug 7 14:32:27 node1 kernel: [ 91.680970] block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1
Aug 7 14:32:27 node1 kernel: [ 91.682426] block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0)
Aug 7 14:32:27 node1 kernel: [ 91.682438] block drbd1: Split-Brain detected but unresolved, dropping connection!
Aug 7 14:32:27 node1 kernel: [ 91.682441] block drbd1: helper command: /sbin/drbdadm split-brain minor-1
Aug 7 14:32:27 node1 kernel: [ 91.684094] block drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)
Aug 7 14:32:27 node1 kernel: [ 91.684109] drbd lxc1: error receiving ReportState, e: -5 l: 0!
Aug 7 14:32:27 node1 kernel: [ 91.684116] drbd lxc1: Terminating drbd_a_lxc1
Aug 7 14:32:27 node1 kernel: [ 91.686478] block drbd3: self 81667BA24C1D4D86:EC36773B148CA0ED:4D7665F7AB17767D:4D7565F7AB17767D bits:6144 flags:0
Aug 7 14:32:27 node1 kernel: [ 91.686481] block drbd3: uuid_compare()=100 by rule 90
Aug 7 14:32:27 node1 kernel: [ 91.688160] block drbd3: helper command: /sbin/drbdadm initial-split-brain minor-3 exit code 0 (0x0)
Aug 7 14:32:27 node1 kernel: [ 91.688199] block drbd3: helper command: /sbin/drbdadm split-brain minor-3
Aug 7 14:32:27 node1 kernel: [ 91.689905] drbd docker1: conn( WFReportParams -> Disconnecting )
Aug 7 14:32:27 node1 kernel: [ 91.689915] drbd docker1: ack_receiver terminated
Aug 7 14:32:27 node1 kernel: [ 91.742453] drbd docker2: Connection closed
Aug 7 14:32:27 node1 kernel: [ 91.743048] drbd docker2: conn( Disconnecting -> StandAlone )
Aug 7 14:32:27 node1 kernel: [ 91.743948] drbd docker2: receiver terminated
Aug 7 14:32:27 node1 kernel: [ 91.744809] drbd docker2: Terminating drbd_r_docker2
Aug 7 14:32:27 node1 kernel: [ 91.830476] drbd lxc1: receiver terminated
Aug 7 14:32:27 node1 kernel: [ 91.830479] drbd docker1: conn( Disconnecting -> StandAlone )
Aug 7 14:32:27 node1 kernel: [ 91.830480] drbd docker1: Terminating drbd_r_docker1