Beyond VMware

Because… Broadcom… a great many organizations are migrating away from VMware as fast as their little legs will carry them. Strong alternatives in the space include Nutanix, Proxmox and HPE VM Essentials.

All of these are based around KVM/QEMU but each brings it’s own flavour of management and additional capability.

I’ve recently been playing HPE VME in my home lab, using a variety of old desktop PCs as the compute nodes and Truenas for storage. There are a few gotchas with my setup which caught me out and wasted a bit of time, so I thought I’d document them here in case it’s useful to someone else.

Mixing CPU types
VMEssentials is qualified to run on specific HPE Proliant hardware. Anyone deploying it seriously is going to have a pile of nice, new, matching hardware. I don’t have that. I have three different generations of Intel CPU and an AMD CPU across my available machines. A VME cluster defaults to a CPU type of Host-Passthrough. This is the most efficient and offers the best performance. However if you want to play with live placement of VMs (VMotion in VMware terms) you’ll have a problem if your physical CPUs don’t match. It might not work.

When attempting to manage placement of a VM from newer to older intel or between intel and AMD it fails. Syslog shows that QEMU wasn’t happy and this makes complete sense because we’re trying to switch the CPU type of the running VM and you can’t do that.

To get around this I changed the cluster to use the qemu64 CPU type – this abstracts the CPU presented to the VM from the physical CPU of the host. There is some performance penalty to this and it may mean CPU capabilities can’t be fully utilized, but for a lab environment that’s fine. This could become a problem in production if you were to add new nodes with sufficiently different CPUs to an existing setup. Here you could either use the qemu64 CPU type or group the new nodes and associate specific VM workloads with the relevant group – bearing in mind the HA requirements you have.

Quorum
If you’re used to clustering that doesn’t present a risk of corruption totally destroying your data, such as a shared file system, you might not have encountered the idea of the cluster maintaining quorum. It’s the same concept used for voting committees the world over – if not every member is present, do you have enough members to take a vote?

I won’t go into the details of how this all works, suffice to say your cluster can’t be quorate if it loses 50% of its members. If you’re building a lab with VME and using GFS2 with shared storage and if you only have two compute nodes and one is offline, you’ll get filesystem locking behaviour you might not expect. Ensure you have an odd number of nodes – for a lab setup, 1 or 3 – or use NFS for your shared storage (with NFS the storage server handles locks)

If you’re interested to see whether your cluster is quorate run the command corosync-quorumtool on the CLI of any node.

I started off with two nodes, would occasionally reboot one and then wonder why things stopped working… that was why. Of course you can run your lab with only two nodes, you just have to make sure they’re both online when you’re running a workload on the GFS2 datastore.