Network resilience

Every so often we experience a network outage because a piece of equipment fails. One switch we use across our campus has a power supply failure mode that trips the power, so one bad switch takes out everything. However, most of time time I’m impressed at just how resilient and reliable the kit is. Network switches in dirty, hot environments run reliably for years. In one case we had a switch with long since failed fans, in a room that used to reach 40°C. It finally fell over one hot summer’s day when the temperature hit 43°C. Even then it was ok once it had cooled down.

Just a bit damp

Most recently there was a water leak in a building. I say leak, a pressure vessel burst so mains pressure hot water poured through two floors of the building for a couple of hours.

Let’s not reflect on the building design that places active network equipment and the building power distribution boards next to the questionable plumbing but instead consider the life of this poor AP-105.

Happily serving clients for the past seven or eight years, it was time for a shower. It died. Not surprising. What’s perhaps more surprising is once dried out the AP functioned perfectly well.

This isn’t the first time water damage has been a problem for us. Investigating a user complaint with a colleague once we found a switch subject to such a long term water leak it had limescale deposits across the front, the pins in the sockets had corroded. It was in a sad way but even though the cabinet resembled Mother Shipton’s cave, the switch was still online.

I have seen network equipment from Cisco, HP, Aruba, Ubiquiti, Extreme, all subject to quite serious abuse in conditions that are far outside the environmental specifications.

This isn’t to suggest we should be cavalier in our attitude towards deployment conditions – rather to celebrate the level of quality and reliability that’s achieved in the design and manufacturing of the equipment we use.

Farewell sweet AP-93H

Towards the end of 2018 we marked the final Aruba AP-125 being decommissioned. A venerable workhorse of an AP, these units provided excellent, reliable service for a really long time. Now it’s the turn of another stalwart of our network estate – the AP-93H.

Aruba AP-93H

The H apparently stands for “hospitality”, or so I’m told… I’ve never checked, and these APs fit over the top of a network socket. They have been invaluable to us.

Aruba are not alone in making devices in this format. Cisco, Ubiquiti and others do the same, but in each case they solve a really big problem. Namely we didn’t put enough wires in.

We’re replacing the AP-93H with Aruba’s current model the AP-303H but it isn’t just bedrooms that get the hospitality treatment.

I’ve written before about our challenges with asbestos, but also the need to have an agile and responsive approach to fast changing network requirements in academic and research environments. The hospitality units are a fantastic way to expand network capacity where there’s no available ceiling level socket for one of our usual models, or maybe we’re not allowed to touch the ceiling anyway.

Stick an AP-303H over the top of a double socket and you can have Wi-Fi and four sockets available. Three of those network sockets can either tunnel that traffic back to the mobility controller or bridge locally to the switch – it’s up to you.

The AP-93H has, like most of the other hardware we’ve replaced, served us well. They’re end of life and not supported past Aruba OS 8.2 and so they have had to be retired. Although they’re dual band, these APs are only single radio so you can have either 2.4GHz or 5GHz, not both. So we welcome the upgraded version, perhaps wish the mounting plates were the same, and carry on with the upgrades.

Aruba OS8 cluster upgrade

Far from the first to share my thoughts about such things, I saw a live demo of an Aruba OS8 cluster being upgraded at the company’s Atmosphere event a couple of years ago. Controllers that were serving the conference were upgraded live, while we all checked our devices to confirm that we were still online.

The live cluster upgrade is probably one of the biggest headline features of AOS8. There are others I particularly like, but that’s for another time. The process works best if you have a reasonable amount of cell overlap in your design.

First all the clients and APs are moved away from one controller in the cluster, and this is upgraded. Once it comes online and syncs up it becomes a cluster master, in a cluster of one. Then a group of APs (AOS calls this a partition) are selected. The aim is that none of the APs selected will be neighbours of each other. The new firmware is pre-loaded onto the partition of APs. Next AOS encourages clients to leave these APs using all the tools of clientmatch it can before the APs reboot. The aim is clients will be able to associate with a neighboring AP.

Once the upgraded APs come back up they’ll join the upgraded controller and so the process rumbles on. If you have multiple controllers at some point AOS will upgrade other controllers and expand the cluster.

For always on networks like a university campus, hospital or airport, this is a great step forward as it allows much more regular code upgrades. A good thing.

However, and there always has to be a downside doesn’t there, it doesn’t always go quite as expected.

I performed a cluster upgrade from 8.3 to 8.4 and it took a long time. In fact it took about 17 hours to upgrade 2500 APs and four controllers. APs can get into a state where the software pre-load doesn’t work. Rebooting the AP will fix this but AOS doesn’t do that. Instead it retries the pre-load five times at 10 minute intervals. This results in the partition of APs taking almost an hour. If you have one AP that doesn’t behave in each partition the entire process drags out for a really long time. Aruba have acknowledged this is an issue and I expect eventually there’ll be a fix or workaround.

So you have a choice – you can do a one hit reboot of all the controllers into new firmware, just as we always did, or you can do a cluster upgrade. One is easiy to communicate to people, the other is might not need any comms at all… it depends on your network design. If you’re confident of cell overlap being really optimal, it perhaps doesn’t matter how long the upgrade takes because your users will hardly notice.