IPv6 on the Wi-Fi

Yes yes yes… We all know that we ran out of IPv4 addresses long ago and IPv6 has been around for 300 years and is the solution to all our problems.

But, we’re still not using it. So why not?

“IPv6 is haaard”

For people who are used to an IPv4 world, with 32 bit addresses you can remember like 178.79.163.251 and broadcast domains with a few hundred addresses, IPv6 can be a terrifying prospect with 128 bit addresses that look like 2a01:7e00::f03c:91ff:fe92:c52b, umpteen billion addresses per subnet, a whole new reliance on ICMP and… well, it’s just different.

I never fail to be slightly surprised how conservative and resistant to change some folk in IT can be. It’s just human nature of course, to keep things working and try to avoid too much rapid change, but still.

Dabbling in IPv6 can result in some odd behaviour as it will generally be used in preference to IPv4. So if you can resolve a IPv6 address but can’t route to it, things break. There are ways to address this and most web browsers do, but people have been caught out and that led to IT departments disabling IPv6 on all Windows systems, for example.

Truth be told, IPv6 isn’t all that hard, so why haven’t we deployed it on our WiFi?

She doesn’t have the capacity Jim!

The problem on our network is address table capacity. Previously I’ve talked about a problem we encountered of filling the arp table on the core routers. Keen not to be burned twice by the same problem, IPv6 presents new challenges.

Firstly there’s the obvious issue of each address being four times the size and therefore needing more memory. Then there’s the issue of just how many addresses you’re dealing with.

With IPv4 a client on our network requests an address via DHCP, is given one, and that’s the end of it.

With IPv6 clients come up with a link-local address (starting FE80:) which can be used to talk to other devices on the same subnet without any configuration being done at all. Then, using SLAAC (Stateless Address Autoconfiguration), the router advertises it’s address and, by virtue of that, the local subnet. Clients then come up with their own IPv6 address based on their hardware MAC address. But this address, built out of the MAC address of the client, can be used to track individual machines as they move around different networks. To avoid this most operating systems will also come up with a privacy address. This is another IPv6 address that’s valid in the subnet but is not based on the MAC address of the hardware. Because there are so many addresses in an IPv6 subnet, the chance of a conflict is… well, you don’t have to worry about it.

These privacy addresses are what the client uses to talk to the world, but the client will also respond on its SLAAC address. Privacy addresses are also changed periodically. When a privacy address changes many clients hold on to the old one in case there’s any incoming traffic still trying to use it.

Long and short of all this is you have to ensure your network equipment can handle the number of IPv6 addresses required for the number of clients being supported.

In our testing so far we’ve seen an average 2.6 IPv6 addresses per client. We think our routers could just handle that for our regular concurrent client count but with no room for growth it’s asking for trouble.

It’s worth mentioning that whilst we really do want to provide globally routed IPv6 addresses to the WiFi clients, this isn’t something we actually need to do right now, but do expect it will be required in the future. And we do have options available if we had to make this work right now, the easiest being to spread the subnets over a few routers so as to avoid the need to replace the core routers. We could also just buy some hardware with enough capacity to handle just the WiFi traffic routing.

This situation though presents an opportunity to look again at our whole network and maybe simplify some of it, using technologies such as VXLAN that weren’t available on the hardware we were using previously.

Suffice to say our Wi-Fi is ready for IPv6, the firewall rules are built and tested, the radius accounting all works… we just need the rest of the network to catch up.

Wireless home automation

Some people are massively into home automation, with a motor and remote control fitted everywhere they possibly can be. I’ve largely not really seen the point of it. I live in a small house, the light switch is never far away. I’ve also found it baffling that in order to switch on the light I might need an internet connection.

However, I did buy a internet connected heating control system opting for Hive by British Gas. The system is easy to install and designed to be a direct replacement for many UK domestic installations. The wireless side of the system uses zigbee and it consists of a boiler control, wireless thermostat and hub that connects to your network. Overall it’s worked well in allowing me to remote control the ancient heating system in my house, but it isn’t really very sophisticated.

The Hive thermostat is a bit…. basic. As far as I can tell it’s old school in that it runs the heating until the temperature measured by the thermostat reaches the target and then stops. The problem is the radiators are then hot, so the temperature in the room keeps rising and will significantly overshoot the target. Before the room temp has dropped to below the target it can start to feel a little cool, because we’ve just acclimatized to that higher temperature. The result is you turn up the heating and end up running the system at a higher temperature than is necessary to be comfortable.

This is how heating control systems have worked for years, but there’s a much better way. Secure (Horstmann) and a few others implement something called Time Proportional Integral (TPI). I don’t pretend to know how this works, but the result is the heating system runs for shorter bursts, switching a predetermined number of times per hour until the temperature is reached and reducing the overshoot that’s common with simple thermostat control.

We have recently got a place in Northern Ireland which we’re using as a base to visit family more frequently. The controls here use TPI but they’re otherwise a standard wired system and I want remote heating control so I can keep an eye on the temperature to make sure it isn’t getting too cold, and also it would be really useful to be able to turn the temperature up before we visit in winter.

Hive is out for the reasons above. Other systems such as Nest that do clever learning of your life patterns are useless in a building that’s not fully occupied all the time. So I’ve gone to rolling my own using Z-Wave controls and Home Assistant with the hass.io distribution on a raspberry pi.

Z-Wave is a really interesting wireless home automation protocol. Like zigbee it employs low bit rate, low energy RF so devices can be battery powered. It is a proprietary protocol though, unlike Zigbee, and whilst I tend to prefer open standards in the real world having a proprietary chipset in every device means it’s easier to get devices from different manufacturers to work together. I have a USB adapter in my raspberry pi so it can act as the Z-Wave controller but a particularly neat feature is devices can be directly paired.

This means I can setup my thermostat to directly control the boiler switch. The state of the two devices is also reported on the Z-Wave network. There’s a really big win with this. If I had a temperature probe in the room and my automation server had to turn on the heating based on a rule what happens if my automation server fails? No heating. Also, adjusting the heating becomes harder. The beauty of using Z-Wave controls, and directly pairing them, is I can have a normal looking thermostat on the wall, and this directly controls the heating. But then I can control the temperature set point of that thermostat remotely using Home Assistant. I can also override the boiler control should the thermostat fail (unlikely) or the batteries run out (more likely).

This gives me the same level of control I have with Hive, but the system isn’t reliant on the hub device being in the middle, I only need my thermostat and my boiler control to make it work. Then you can start getting into the smarter automation functions. Home Assistant can pull in calendar information using caldav and turn that into a switch. So I can automate whether the heating is set above frost protect based on whether there’s a booking in the house booking diary. Which means I don’t have to worry about a visitor using our house turning the heating up and leaving it.

Using Home Assistant with Z-Wave allows for other neat control options such as lighting, blinds, whatever, in order to make an empty home look less empty. But it also allows for those controls to always have a local activation so when a family member who doesn’t own a smartphone wants to use our house, that’s no problem. Essentially I can build up a home automation system to be as simple or sophisticated as I want, but never need to have an internet connection in order to turn on the bathroom light.

But for now, it’s just going to be the heating.

What? No ARP?

In a lesson of just how important the LAN part of WLAN is, here’s tale of a little problem that hit our network recently… We ran out of space on the ARP table of our core routers.

“What the hell?” I hear you say, “what is this, 2003?” and well you might ask.

Over that last couple of years we’ve been upgrading our campus network, which represents a very large number of switches, from primarily HP Procurve 2600 series edge with 5400zl doing the OSPF routing to primarily the comware range with HPE 5130 edge and 5900 doing the OSPF then 5930s at the core to replace the previous 5900 (setup in a terrible design that meant we could never upgrade them without bringing down everything). It’s a slow process because we try not to break things as we migrate them and, as I mentioned, it’s a lot of switches.

Our network isn’t really all that complicated, there’s just a lot of edge. So an IRF pair of 5930s as the core router in each datacentre seemed to be just fine.

Very recently we upgraded our Aruba mobility controllers and moved them off the 5400 switches that have been handling all our Wi-Fi traffic and on to some of the new kit in our datacentres.

So far so good.

We then slowly moved the subnets from the 5400s to our 5930s. This work was well over 50% complete when one morning, just near the start of the university term, we seemed to have a problem.

Some Wi-Fi users seemed to have no connectivity. We quickly established that plenty of traffic was flowing from the Wi-FI controllers and through our off site link. Whilst the problem seemed to be fairly serious, it wasn’t affecting hundreds of people, as far as we could tell.

Theories flew round the office as we tried to understand what was happening and why some users seemed to have a perfectly good Wi-Fi connection and layer2 traffic was passing, but they couldn’t do anything useful…. There seemed to be no pattern, and a broken user would suddenly start working with nothing having changed.

It was spotted that we had 16,384 entries on the arp table of our 5930s and this was initially dismissed as a small number, but one of my brilliant colleagues pointed out that it was a rather neat, round number and that wasn’t likely to be a good thing.

It turns out that all the comware switches we’re using as routers, 5510, 5900 & 5930 have a max arp table size of 16,384.

As this term has kicked off we’ve seen higher numbers on our Wi-Fi alone, and the 5930s are also routing for all the servers in our datacentres.

This was a pretty basic problem. We’d just filled up the tables and our routing switches were no longer doing the business.

This issue caught us out because, as previously mentioned, this is quite a small number. The generally highly specced 5930 will handle 288K mac addresses and a route table of over 100K. More significantly the decade old switches we were replacing didn’t have this arp table limitation.

Another reason this slipped past is the ARP table size doesn’t appear on the spec sheets of many switches.

We just assumed these very capable datacentre switches had the horsepower and memory allocation to do what we needed, and assuming made an ass of whoever.

Cisco fans will tell you they’ve long had the ability to allocate the finite amount of memory in a layer3 switch to the tables you need, balancing the finite resources between L3 routes, MAC addresses and ARP tables. Fortunately this functionality is now being made available to the 5930.

In our case this means we can reduce the routing table size (I don’t think we need 10K, never mind 100K+) and give more room for arp entries. We can then try again to move the Wi-Fi subnets and, hopefully avoid problems.

The lesson from this has to be the importance of understanding the spec of a box you’re putting at the centre of a network. I can comment on the importance of assessing the implications of a change, such as moving the routing for 16K clients, but to be honest we’d still have assumed the big switches could handle this.

So, roll on the disruption of multiple reboots to bring our 5930s up to the software version that will do what we need, and in the meantime the venerable 5400zl continues to just work.

Finally, I should stress that our network is really very reliable. Serious outages used to be relatively common place and I can’t remember when we last experienced an unexpected widespread outage. This, again, is why we were caught out by this. The 5930s have just been rock solid and they spoiled us.

Audinate Dante, with lots of switches, and latency

I’m sure someone once said you shouldn’t use commas in a headline or title but they’re not the boss of me and I don’t care. So, to business…. Here’s one of the rare posts that isn’t about the WiFi.

As someone who’s done my fair share of live sound and studio audio work, I’m something of a gear head. One of the big developments in the area of digital audio has been networked audio and by far the biggest, and some would say only truly relevant player, is Australian company Audinate with their product Dante.

For those that don’t know, Dante is a networked audio system that allows multiple unicast or multicast streams of high quality, uncompressed, sample-accurate audio to be routed between devices across a network. It’s based around custom chips that are designed and sold by Audinate then used by pretty much every serious audio pro audio equipment manufacturer in the world.

Perhaps the best thing about Dante is that it’s just another network device. It doesn’t need anything special to work, and it will work on standard network switches. It’s perhaps no surprise that Dante has become really popular in AV installs. You can even get POE powered speakers, so rolling out a PA system through conference system corridors can all be done just with a network cable. It’s neat stuff.

But this post is less about Dante itself, and how awesome it is, but more how it can impact design decisions in modern networks.

For this example I’m going to use a large lecture hall – a flexible space that can seat 1500 people and is used for conferences, lectures, public events, etc.

A key consideration of a digital audio system is latency – generally referring to how long it takes audio to make it’s way through a system from, say, a microphone input to the speaker output. The old analogue audio system had effectively no latency. Electrons being inconvenienced along the various mic cables, through the mixer, amplifiers and out to the speakers all happens extremely quickly. In digital systems we have analogue to digital conversion, which takes a bit of time and we invariably introduce buffers, which hold things up a bit and leads to data waiting around. Almost every part of a digital audio system is slower than analogue. Add all these delays together and the system latency can easily get too long to be acceptable.

The same can be said about networking. Every network device your frame or packet passes through will add a bit of latency.

Dante moves audio across an Ethernet network, and switched Ethernet is really very fast (worth noting that in a standard Dante network doesn’t route across subnets). Dante conceptually supports incredibly low latency settings, but most of the chipsets have a minimum latency setting of 1ms. This means anything receiving a stream from that device will allocate a 1ms buffer and will therefore handle network latency of up to 1ms without any disruption to the audio. If the latency exceeds 1ms then you’re in trouble.

A brief blip might be barely noticeable, but sustained high latency will result in late packets being dropped and particularly unpleasant sounding audio stream.

Having discussed latency, and network switches, let’s get to the core of the issue…

Recently installed Dante equipment wasn’t working very well in the lecture hall. The network switches are HPE 5130 with IRF stacking. There are three switch stacks in the building, connected back to an IRF pair of HPE 5900 switches (acting as a router) with 2 x 10Gb links from each stack in a LACP. Stack1 has two switches, stack 2 has five and stack3 has nine switches.

From an administrative perspective there are three switches in the building, though of course we have to remember there’s really far more… why? because latency, that’s why.

Our problem dante devices are situated in mobile racks, each containing their own switch – this allows them to be easily unplugged and moved around.

It turns out one mobile rack was connected to stack 2 and the other was into stack 3. With 1ms latency configured Audinate reckon Dante is good for 10 gigabit switches or about three 100Mb switches (because Fast Ethernet has higher latency) but we were having problems, so just how many switches was our traffic passing through?

The trouble is, I don’t know. Because our Comware switches are in IRF stacks how traffic moves around between members is inside the black box and not available for analysis. Our stacks are a ring topology and we have two uplinks, I don’t know which way around the ring the traffic is going or which of the two uplinks my it’s using.

So I have to assume a worst case scenario, even if this shouldn’t actually happen. Here goes: My first mobile rack switch is connected to stack2 member 1 and that stack hashes my traffic to the second uplink which is on member 9. For some reason rather than taking the shortest route my traffic makes it’s way the long way around the ring, therefore passing through a total of five switches.

It then heads to the routing switch, also an IRF pair, so let’s assume we pass through both of them and on towards stack3. Again we have to assume the worse case scenario, that the traffic passes through all nine switches.

The end result is that it’s possible the traffic could traverse 18 switches.

It’s hard to put some figures on this. According to the HPE 5130 datasheet the gigabit latency is <5us. What I don’t know is whether that changes as more complex configs are applied. 10Gb latency is an even lower <3us. However we don’t know how and even if the IRF affects this. These switches are blisteringly, mind bendingly fast. Even a much cheaper switch is really, really quick, and it’s why we can build switched networks capable of incredible performance. All hail the inventor of the network switch, who’s name I can’t be bothered to look up.

Even though that blisteringly fast performance x 18 still adds up, but not to much. If our Dante traffic follows the worst case scenario, however unlikely, we still only get to something like 200us of latency. This should be absolutely fine, but it wasn’t reliable. I suspect a big part of this is general network conditions. The more physical switches your Dante traffic passes through, the higher the chances it could face moments of congestion.

Back to reality then. Increasing the latency setting of the Dante device would remedy this at the cost of increasing the total audio system latency. If that’s acceptable, it’s the best choice because it provides a healthy margin. The other alternative is to change the network patching so we limit the number of switches. In this case we’ve been able to keep the critical devices connected to stack3 and the very worst case scenario is we’re down to 11 switches and not touching the uplinks but in fact most Dante devices are linked to the same switch stack member.

At the end of this long, confusing mess of writing then I’m left uncertain quite what was causing the issue. Our theoretical latency was well below the minimum required and yet we ran into issues. Probably the most important takeaway for a network engineer is that when working with something as latency sensitive and uncompromising as Dante, it’s important to consider the number of devices in the network path… you know, just like the deployment guide says.

Aruba Control Plane Security and the AP-203H

Here’s a useful little tidbit for anyone with an Aruba OS 6.5 environment wanting to enable control plane security with some AP-203H on the network.

(EDIT: This also applies to AOS8 where CPsec is now enabled by default. Whilst the AP-203H is quite a dated now, there may be some still around and you will experience these issues when migrating from AOS6/6.5 to AOS8)

With Aruba campus APs all traffic between users and the controller is encrypted (probably) by virtue of the Wi-Fi encryption in use being between the client and the mobility controller, rather than being decrypted by the AP. So unless you’re using an open network all your client traffic is encrypted until it pops out of the controller. Lovely.

Control plane traffic, the AP talking to the controller, is not. This isn’t usually a problem as we mostly trust the wires (whether we should or not is another matter).

In many Aruba controller environments all the user traffic is tunneled back to the controller in our data centre, which works very well especially when users are highly mobile around a campus.

However it’s also possible to bridge an SSID so the AP drops the traffic on a local vlan. This is desirable in some circumstances, one real world example being a robotics research lab needing to connect devices to a local subnet. In order to enable this you first have to switch on control plane security.

Switching CPsec on causes all APs to reboot at least twice so on an existing deployment it leads to 8-15 minutes of downtime. I switched CPsec on this morning for our network with approximately 2500 APs, it was a tense time but went well…. mostly.

I’d read stories of APs with a bad trusted platform module being unable to share their certificate with the controller. We have a lot of APs from many years old to brand new and even a low percentage failure rate would present a problem.

In the end two APs failed, with lots of TPM initialization errors. However all our AP-203H units failed to come back up and the controller process logs started showing things like this:

Sep 4 07:57:37 nanny[1399]: <303022> <WARN> |AP <apname>@<ip_addr> nanny| Reboot Reason: AP rebooted Wed Dec 31 16:01:41 PST 1969; Unable to set up IPSec tunnel to saved lms, Error:RC_ERROR_CPSEC_CERT_REJECTED

It took a little while to become clear just one model of AP was affected and it was all of them.

A bit of time with Aruba TAC later and it transpires it’s necessary to execute the command: crypto-local pki allow-low-assurance-devices

This needs to be run on each controller separately, and saved of course.

The command is detailed as allowing non-tpm devices to connect to the controller. I’m not entirely clear what’s special about the AP-203H, presumably it doesn’t have a TPM, yet it does have a factory installed certificate. We also have some AP-103H units, which don’t have a TPM but they work just fine with a switch-cert. I suspect this is a bug in the firmware and Aruba OS is treating the 203H as if it has a TPM but as it doesn’t everything falls apart.

Clearly if you want the maximum possible security, allowing low assurance devices is presumably going to raise eyebrows. In our case we’re happy with this limitation.

May this post find someone else who one day searches for RC_ERROR_CPSEC_CERT_REJECTED ap-203h

Where’s the data?

On my quest to learn this week I had the privilege to attend Peter Mackenzie’s Certified Wireless Analysis Professional class. When a colleague and I attended Peter’s CWNA class a couple of years ago he suggested that CWAP should the be next port of call after passing the CWNA exam. Initially I thought that was mainly because he wrote the book (and an excellent book it is too) but actually CWAP goes deeper into the protocol and builds an even better foundation for an understanding of how WiFi works.

Most people who’ve dabbled in networking are familiar with Wireshark. It’s a fantastic tool, and a great way of troubleshooting problems. With packet capture you can prove that client isn’t getting an IP address because it isn’t doing DHCP properly if at all, rather than it being a wider network issue…

Wired packet capture is usually easy. Mirror a port on a switch, put an actual honest to goodness old school hub in line (if you have one and can tolerate the lower bandwidth during capture), or if you have the resources get a fancy tap or dedicated capture device such as the Fluke Onetouch. Usually we have one of these methods available but for Wi-Fi it’s not quite so easy.

Mac users have these fantastic tools available, and there are good options for Linux users but for Windows folk life can be tough and expensive.

Wi-Fi drivers for Windows tend not to expose the necessary level of control to applications. So you’re left needing a wireless NIC with a specific chipset for which there’s a magic driver, or getting hold of an expensive dedicated capture NIC.

There is one option that I’ve played with in the form of Tarlogic’s Acrylic WiFi. This affordable software includes a ndis driver that interfaces with the Wi-Fi NIC and presents monitor mode data to applications. Analysis within Acrylic is fairly poor but it will save pcap files and it’s possible to use the Acrylic driver directly in Wireshark.

The problem is that many new drivers don’t interface with ndis in Windows as things used to so as a result there’s a shrinking number of available Wi-Fi NICs that still work.

Some time ago I bought an Asus AC53 USB NIC which is on the Acrylic supported list and it is possible to install the old windows8.1 drivers on windows 10. However this doesn’t support DFS channels, which is a problem because we use DFS channels.

Fear not though it is, just about, possible to make this work. The Netgear A6200 uses the same Broadcom chipset and supports DFS channels. Once installed I was able to select the netgear driver for the Asus device and sure enough it works.

Which is a long lead in to this little tidbit, a key take away from Peter’s course this week.

When looking at a wireless capture it’s important to remember you might not be seeing all the information. Physical headers are stripped off by the hardware of the Wi-Fi interface, so you can’t see those. Wi-Fi does certain things with physical headers only, such as Mu-Mimo channel sounding and aspects of this are therefore not visible to a protocol analyser.

It probably goes without saying that you can’t see encrypted data, but you can usually see that it’s been transmitted… Unless your wireless interface can’t decode it.

Let’s say, for example, your AP is using a 40MHz DFS channel and the capture setup you’re using can’t be configured to use 40MHz channels. In this scenario, because management and action frames are all transmitted on the primary 20MHz channel you can see these just fine, yet all the higher rate data frames that take advantage of the full 40MHz just disappear.

The result looks a bit like the picture here… A client sends RTS, the AP responds CTS and the next thing is a block acknowledgement but no data.

It’s sometimes possible to see the data transfer betrayed in the time between the packets but here, because the data rate is high and the traffic is light, it’s not particularly apparent.

I’m particularly looking forward to spending more time digging into protocol analysis and hopefully getting some better tools.

mDNS on a corporate Wi-Fi network

You know how your Wi-Fi printer or Apple TV or Google Chromecast just sort of magically works at home? These devices do this using mDNS to allow things to discover each other, but this method of “just works TM” tends not to be good news for enterprise networks. Continue reading →

Power mystery solved

A little while ago I posted about some issues seen with some new Aruba APs apparently putting out less power than the older model they replaced. The mystery has now been solved, and it isn’t all that mysterious.

Continue reading →

From coverage to capacity

For a long time the general approach to Wi-Fi design has been about coverage – ensuring there’s an adequate signal level across the desired service area. That’s fine for some deployments but if you’re going to have 150 devices in a room it’s necessary to think about how much capacity your wlan can offer. Continue reading →

Why so busy 5GHz channel?

A follow up to this post about new APs leading to a much worse experience for users…. perhaps no surprise things were more complicated. There are more bugs.