Far from the first to share my thoughts about such things, I saw a live demo of an Aruba OS8 cluster being upgraded at the company’s Atmosphere event a couple of years ago. Controllers that were serving the conference were upgraded live, while we all checked our devices to confirm that we were still online.
The live cluster upgrade is probably one of the biggest headline features of AOS8. There are others I particularly like, but that’s for another time. The process works best if you have a reasonable amount of cell overlap in your design.
First all the clients and APs are moved away from one controller in the cluster, and this is upgraded. Once it comes online and syncs up it becomes a cluster master, in a cluster of one. Then a group of APs (AOS calls this a partition) are selected. The aim is that none of the APs selected will be neighbours of each other. The new firmware is pre-loaded onto the partition of APs. Next AOS encourages clients to leave these APs using all the tools of clientmatch it can before the APs reboot. The aim is clients will be able to associate with a neighboring AP.
Once the upgraded APs come back up they’ll join the upgraded controller and so the process rumbles on. If you have multiple controllers at some point AOS will upgrade other controllers and expand the cluster.
For always on networks like a university campus, hospital or airport, this is a great step forward as it allows much more regular code upgrades. A good thing.
However, and there always has to be a downside doesn’t there, it doesn’t always go quite as expected.
I performed a cluster upgrade from 8.3 to 8.4 and it took a long time. In fact it took about 17 hours to upgrade 2500 APs and four controllers. APs can get into a state where the software pre-load doesn’t work. Rebooting the AP will fix this but AOS doesn’t do that. Instead it retries the pre-load five times at 10 minute intervals. This results in the partition of APs taking almost an hour. If you have one AP that doesn’t behave in each partition the entire process drags out for a really long time. Aruba have acknowledged this is an issue and I expect eventually there’ll be a fix or workaround.
So you have a choice – you can do a one hit reboot of all the controllers into new firmware, just as we always did, or you can do a cluster upgrade. One is easiy to communicate to people, the other is might not need any comms at all… it depends on your network design. If you’re confident of cell overlap being really optimal, it perhaps doesn’t matter how long the upgrade takes because your users will hardly notice.