Piviul On 07/11/22 08:23, Piviul wrote:
Ciao Marco, anzitutto grazie! On 05/11/22 22:00, Marco Gaiarin wrote:[...]Se non hai quorum (stile: hai tre nodi, quorum a 2 (il default) e un nodofermo) è normale che le macchine non partano.dunque, forse non sono stato chiaro... i 3 nodi vanno tutti e 3. Mi posso connettere ad ognuno dei 3 nodi anche tramite interfaccia web ma da un nodo non si vedono gli altri 2. È il cluster che ha dei problemi. Il servizio che non riesce a partire è il "Proxmox VE replication runner" (pvesr), fra le altre cose non ho alcuna replica impostata a memoria, utilizzo solo ceph e mi sono liberato di zfs. Tutte gli host virtuali funzionano bene non hanno problemi tranne che non funziona più il backup e non funzionando il cluster non posso nemmeno migrarli di nodo; inoltre se spengo una macchina poi non riesco più a riaccenderla.'pvesr' è il servizio di replicazione degli storage ZFS; se non hai ZFS o non usi la replicazione, quel'errore è ininfluente (e comunque nulla centracon il quorum).A naso hai corosync fermo su un nodo, e magai ti basta riavviare quello (oriavviare corosync su quello); prova a lanciare: pvecm status su tutti i nodi e vedi che ti dice.Corosync non ha problemi, l'ho riavviato su tutti e 3 i nodi e non ha dato problemi. Questo è l'output di pvecm status:# pvecm status Cluster information ------------------- Name: CSA-cluster1 Config Version: 3 Transport: knet Secure auth: on Quorum information ------------------ Date: Mon Nov 7 08:11:51 2022 Quorum provider: corosync_votequorum Nodes: 1 Node ID: 0x00000001 Ring ID: 1.91e Quorate: No Votequorum information ---------------------- Expected votes: 3 Highest expected: 3 Total votes: 1 Quorum: 2 Activity blocked Flags: Membership information ---------------------- Nodeid Votes Name 0x00000001 1 192.168.255.1 (local) In syslog è pieno di messaggi tipo:Nov 7 08:14:33 pve01 pveproxy[2699797]: Cluster not quorate - extending auth key lifetime! Nov 7 08:14:35 pve01 pveproxy[2699797]: Cluster not quorate - extending auth key lifetime!Inoltre Tutto è iniziato venerdì sera e in syslog ho trovato: 4 23:37:01 pve02 systemd[1]: Started Proxmox VE replication runner.Nov 4 23:37:01 pve02 CRON[2145590]: (root) CMD (if test -x /usr/sbin/apticron; then /usr/sbin/apticron --cron; else true; fi) Nov 4 23:38:00 pve02 systemd[1]: Starting Proxmox VE replication runner...Nov 4 23:38:01 pve02 systemd[1]: pvesr.service: Succeeded. Nov 4 23:38:01 pve02 systemd[1]: Started Proxmox VE replication runner.Nov 4 23:38:26 pve02 corosync[1703]: [KNET ] link: host: 3 link: 0 is down Nov 4 23:38:26 pve02 corosync[1703]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) Nov 4 23:38:26 pve02 corosync[1703]: [KNET ] host: host: 3 has no active links Nov 4 23:38:28 pve02 corosync[1703]: [TOTEM ] Token has not been received in 2737 ms Nov 4 23:38:30 pve02 corosync[1703]: [KNET ] rx: host: 3 link: 0 is up Nov 4 23:38:30 pve02 corosync[1703]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)Nov 4 23:38:32 pve02 corosync[1703]: [QUORUM] Sync members[2]: 1 2 Nov 4 23:38:32 pve02 corosync[1703]: [QUORUM] Sync left[1]: 3Nov 4 23:38:32 pve02 corosync[1703]: [TOTEM ] A new membership (1.873) was formed. Members left: 3 Nov 4 23:38:32 pve02 corosync[1703]: [TOTEM ] Failed to receive the leave message. failed: 3 Nov 4 23:38:32 pve02 pmxcfs[1578]: [dcdb] notice: members: 1/1626, 2/1578 Nov 4 23:38:32 pve02 pmxcfs[1578]: [dcdb] notice: starting data syncronisation Nov 4 23:38:32 pve02 pmxcfs[1578]: [status] notice: members: 1/1626, 2/1578 Nov 4 23:38:32 pve02 pmxcfs[1578]: [status] notice: starting data syncronisationNov 4 23:38:32 pve02 corosync[1703]: [QUORUM] Members[2]: 1 2Nov 4 23:38:32 pve02 corosync[1703]: [MAIN ] Completed service synchronization, ready to provide service. Nov 4 23:38:32 pve02 pmxcfs[1578]: [dcdb] notice: received sync request (epoch 1/1626/00000009) Nov 4 23:38:32 pve02 pmxcfs[1578]: [status] notice: received sync request (epoch 1/1626/00000009)Nov 4 23:38:32 pve02 pmxcfs[1578]: [dcdb] notice: received all states Nov 4 23:38:32 pve02 pmxcfs[1578]: [dcdb] notice: leader is 1/1626Nov 4 23:38:32 pve02 pmxcfs[1578]: [dcdb] notice: synced members: 1/1626, 2/1578Nov 4 23:38:32 pve02 pmxcfs[1578]: [dcdb] notice: all data is up to dateNov 4 23:38:32 pve02 pmxcfs[1578]: [dcdb] notice: dfsm_deliver_queue: queue length 2Nov 4 23:38:32 pve02 pmxcfs[1578]: [status] notice: received all statesNov 4 23:38:32 pve02 pmxcfs[1578]: [status] notice: all data is up to date Nov 4 23:38:32 pve02 pmxcfs[1578]: [status] notice: dfsm_deliver_queue: queue length 46 Nov 4 23:38:34 pve02 corosync[1703]: [KNET ] link: host: 3 link: 0 is down Nov 4 23:38:34 pve02 corosync[1703]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) Nov 4 23:38:34 pve02 corosync[1703]: [KNET ] host: host: 3 has no active links Nov 4 23:38:41 pve02 corosync[1703]: [KNET ] link: host: 1 link: 0 is down Nov 4 23:38:41 pve02 corosync[1703]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Nov 4 23:38:41 pve02 corosync[1703]: [KNET ] host: host: 1 has no active links Nov 4 23:38:42 pve02 corosync[1703]: [TOTEM ] Token has not been received in 2737 ms Nov 4 23:38:43 pve02 corosync[1703]: [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.Nov 4 23:38:48 pve02 corosync[1703]: [QUORUM] Sync members[1]: 2 Nov 4 23:38:48 pve02 corosync[1703]: [QUORUM] Sync left[1]: 1Nov 4 23:38:48 pve02 corosync[1703]: [TOTEM ] A new membership (2.877) was formed. Members left: 1 Nov 4 23:38:48 pve02 corosync[1703]: [TOTEM ] Failed to receive the leave message. failed: 1Nov 4 23:38:48 pve02 pmxcfs[1578]: [dcdb] notice: members: 2/1578 Nov 4 23:38:48 pve02 pmxcfs[1578]: [status] notice: members: 2/1578Nov 4 23:38:48 pve02 corosync[1703]: [QUORUM] This node is within the non-primary component and will NOT provide any services.Nov 4 23:38:48 pve02 corosync[1703]: [QUORUM] Members[1]: 2Nov 4 23:38:48 pve02 corosync[1703]: [MAIN ] Completed service synchronization, ready to provide service.Nov 4 23:38:48 pve02 pmxcfs[1578]: [status] notice: node lost quorumNov 4 23:38:48 pve02 pmxcfs[1578]: [dcdb] crit: received write while not quorate - trigger resyncNov 4 23:38:48 pve02 pmxcfs[1578]: [dcdb] crit: leaving CPG groupNov 4 23:38:48 pve02 pve-ha-lrm[1943]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve02/lrm_status.tmp.1943' - Permission denied Nov 4 23:38:49 pve02 pmxcfs[1578]: [dcdb] notice: start cluster connectionNov 4 23:38:49 pve02 pmxcfs[1578]: [dcdb] crit: cpg_join failed: 14 Nov 4 23:38:49 pve02 pmxcfs[1578]: [dcdb] crit: can't initialize service Nov 4 23:38:55 pve02 pmxcfs[1578]: [dcdb] notice: members: 2/1578 Nov 4 23:38:55 pve02 pmxcfs[1578]: [dcdb] notice: all data is up to dateNov 4 23:39:00 pve02 systemd[1]: Starting Proxmox VE replication runner... Nov 4 23:39:01 pve02 pvesr[2146320]: trying to acquire cfs lock 'file-replication_cfg' ...[...]Cercando in rete ho visto che non sono il solo ad avere questo problema e tutti indirizzano ad impostare il quorum a uno con pvecm expected 1 ma sono un po' preoccupato prima di fare qualunque cosa vorrei esserne ipercerto non avendo nemmeno il backup delle macchine virtuali!Grazie un monte Piviul