What? No ARP?

In a lesson of just how important the LAN part of WLAN is, here’s tale of a little problem that hit our network recently… We ran out of space on the ARP table of our core routers.

“What the hell?” I hear you say, “what is this, 2003?” and well you might ask.

Over that last couple of years we’ve been upgrading our campus network, which represents a very large number of switches, from primarily HP Procurve 2600 series edge with 5400zl doing the OSPF routing to primarily the comware range with HPE 5130 edge and 5900 doing the OSPF then 5930s at the core to replace the previous 5900 (setup in a terrible design that meant we could never upgrade them without bringing down everything). It’s a slow process because we try not to break things as we migrate them and, as I mentioned, it’s a lot of switches.

Our network isn’t really all that complicated, there’s just a lot of edge. So an IRF pair of 5930s as the core router in each datacentre seemed to be just fine.

Very recently we upgraded our Aruba mobility controllers and moved them off the 5400 switches that have been handling all our Wi-Fi traffic and on to some of the new kit in our datacentres.

So far so good.

We then slowly moved the subnets from the 5400s to our 5930s. This work was well over 50% complete when one morning, just near the start of the university term, we seemed to have a problem.

Some Wi-Fi users seemed to have no connectivity. We quickly established that plenty of traffic was flowing from the Wi-FI controllers and through our off site link. Whilst the problem seemed to be fairly serious, it wasn’t affecting hundreds of people, as far as we could tell.

Theories flew round the office as we tried to understand what was happening and why some users seemed to have a perfectly good Wi-Fi connection and layer2 traffic was passing, but they couldn’t do anything useful…. There seemed to be no pattern, and a broken user would suddenly start working with nothing having changed.

It was spotted that we had 16,384 entries on the arp table of our 5930s and this was initially dismissed as a small number, but one of my brilliant colleagues pointed out that it was a rather neat, round number and that wasn’t likely to be a good thing.

It turns out that all the comware switches we’re using as routers, 5510, 5900 & 5930 have a max arp table size of 16,384.

As this term has kicked off we’ve seen higher numbers on our Wi-Fi alone, and the 5930s are also routing for all the servers in our datacentres.

This was a pretty basic problem. We’d just filled up the tables and our routing switches were no longer doing the business.

This issue caught us out because, as previously mentioned, this is quite a small number. The generally highly specced 5930 will handle 288K mac addresses and a route table of over 100K. More significantly the decade old switches we were replacing didn’t have this arp table limitation.

Another reason this slipped past is the ARP table size doesn’t appear on the spec sheets of many switches.

We just assumed these very capable datacentre switches had the horsepower and memory allocation to do what we needed, and assuming made an ass of whoever.

Cisco fans will tell you they’ve long had the ability to allocate the finite amount of memory in a layer3 switch to the tables you need, balancing the finite resources between L3 routes, MAC addresses and ARP tables. Fortunately this functionality is now being made available to the 5930.

In our case this means we can reduce the routing table size (I don’t think we need 10K, never mind 100K+) and give more room for arp entries. We can then try again to move the Wi-Fi subnets and, hopefully avoid problems.

The lesson from this has to be the importance of understanding the spec of a box you’re putting at the centre of a network. I can comment on the importance of assessing the implications of a change, such as moving the routing for 16K clients, but to be honest we’d still have assumed the big switches could handle this.

So, roll on the disruption of multiple reboots to bring our 5930s up to the software version that will do what we need, and in the meantime the venerable 5400zl continues to just work.

Finally, I should stress that our network is really very reliable. Serious outages used to be relatively common place and I can’t remember when we last experienced an unexpected widespread outage. This, again, is why we were caught out by this. The 5930s have just been rock solid and they spoiled us.

Audinate Dante, with lots of switches, and latency

I’m sure someone once said you shouldn’t use commas in a headline or title but they’re not the boss of me and I don’t care. So, to business…. Here’s one of the rare posts that isn’t about the WiFi.

As someone who’s done my fair share of live sound and studio audio work, I’m something of a gear head. One of the big developments in the area of digital audio has been networked audio and by far the biggest, and some would say only truly relevant player, is Australian company Audinate with their product Dante.

For those that don’t know, Dante is a networked audio system that allows multiple unicast or multicast streams of high quality, uncompressed, sample-accurate audio to be routed between devices across a network. It’s based around custom chips that are designed and sold by Audinate then used by pretty much every serious audio pro audio equipment manufacturer in the world.

Perhaps the best thing about Dante is that it’s just another network device. It doesn’t need anything special to work, and it will work on standard network switches. It’s perhaps no surprise that Dante has become really popular in AV installs. You can even get POE powered speakers, so rolling out a PA system through conference system corridors can all be done just with a network cable. It’s neat stuff.

But this post is less about Dante itself, and how awesome it is, but more how it can impact design decisions in modern networks.

For this example I’m going to use a large lecture hall – a flexible space that can seat 1500 people and is used for conferences, lectures, public events, etc.

A key consideration of a digital audio system is latency – generally referring to how long it takes audio to make it’s way through a system from, say, a microphone input to the speaker output. The old analogue audio system had effectively no latency. Electrons being inconvenienced along the various mic cables, through the mixer, amplifiers and out to the speakers all happens extremely quickly. In digital systems we have analogue to digital conversion, which takes a bit of time and we  invariably introduce buffers, which hold things up a bit and leads to data waiting around. Almost every part of a digital audio system is slower than analogue. Add all these delays together and the system latency can easily get too long to be acceptable.

The same can be said about networking. Every network device your frame or packet passes through will add a bit of latency.

Dante moves audio across an Ethernet network, and switched Ethernet is really very fast (worth noting that in a standard Dante network doesn’t route across subnets). Dante conceptually supports incredibly low latency settings, but most of the chipsets have a minimum latency setting of 1ms. This means anything receiving a stream from that device will allocate a 1ms buffer and will therefore handle network latency of up to 1ms without any disruption to the audio. If the latency exceeds 1ms then you’re in trouble.

A brief blip might be barely noticeable, but sustained high latency will result in late packets being dropped and particularly unpleasant sounding audio stream.

Having discussed latency, and network switches, let’s get to the core of the issue…

Recently installed Dante equipment wasn’t working very well in the lecture hall. The network switches are HPE 5130 with IRF stacking. There are three switch stacks in the building, connected back to an IRF pair of HPE 5900 switches (acting as a router) with 2 x 10Gb links from each stack in a LACP. Stack1 has two switches, stack 2 has five and stack3 has nine switches.

From an administrative perspective there are three switches in the building, though of course we have to remember there’s really far more… why? because latency, that’s why.

Our problem dante devices are situated in mobile racks, each containing their own switch – this allows them to be easily unplugged and moved around.

It turns out one mobile rack was connected to stack 2 and the other was into stack 3. With 1ms latency configured Audinate reckon Dante is good for 10 gigabit switches or about three 100Mb switches (because Fast Ethernet has higher latency) but we were having problems, so just how many switches was our traffic passing through?

The trouble is, I don’t know. Because our Comware switches are in IRF stacks how traffic moves around between members is inside the black box and not available for analysis. Our stacks are a ring topology and we have two uplinks, I don’t know which way around the ring the traffic is going or which of the two uplinks my it’s using.

So I have to assume a worst case scenario, even if this shouldn’t actually happen. Here goes: My first mobile rack switch is connected to stack2 member 1 and that stack hashes my traffic to the second uplink which is on member 9. For some reason rather than taking the shortest route my traffic makes it’s way the long way around the ring, therefore passing through a total of five switches.

It then heads to the routing switch, also an IRF pair, so let’s assume we pass through both of them and on towards stack3. Again we have to assume the worse case scenario, that the traffic passes through all nine switches.

The end result is that it’s possible the traffic could traverse 18 switches.

It’s hard to put some figures on this. According to the HPE 5130 datasheet the gigabit latency is <5us. What I don’t know is whether that changes as more complex configs are applied. 10Gb latency is an even lower <3us. However we don’t know how and even if the IRF affects this. These switches are blisteringly, mind bendingly fast. Even a much cheaper switch is really, really quick, and it’s why we can build switched networks capable of incredible performance. All hail the inventor of the network switch, who’s name I can’t be bothered to look up.

Even though that blisteringly fast performance x 18 still adds up, but not to much. If our Dante traffic follows the worst case scenario, however unlikely, we still only get to something like 200us of latency. This should be absolutely fine, but it wasn’t reliable. I suspect a big part of this is general network conditions. The more physical switches your Dante traffic passes through, the higher the chances it could face moments of congestion.

Back to reality then. Increasing the latency setting of the Dante device would remedy this at the cost of increasing the total audio system latency. If that’s acceptable, it’s the best choice because it provides a healthy margin. The other alternative is to change the network patching so we limit the number of switches. In this case we’ve been able to keep the critical devices connected to stack3 and the very worst case scenario is we’re down to 11 switches and not touching the uplinks but in fact most Dante devices are linked to the same switch stack member.

At the end of this long, confusing mess of writing then I’m left uncertain quite what was causing the issue. Our theoretical latency was well below the minimum required and yet we ran into issues. Probably the most important takeaway for a network engineer is that when working with something as latency sensitive and uncompromising as Dante, it’s important to consider the number of devices in the network path… you know, just like the deployment guide says.

Aruba Control Plane Security and the AP-203H

Here’s a useful little tidbit for anyone with an Aruba OS 6.5 environment wanting to enable control plane security with some AP-203H on the network.

(EDIT: This also applies to AOS8 where CPsec is now enabled by default. Whilst the AP-203H is quite a dated now, there may be some still around and you will experience these issues when migrating from AOS6/6.5 to AOS8)

With Aruba campus APs all traffic between users and the controller is encrypted (probably) by virtue of the Wi-Fi encryption in use being between the client and the mobility controller, rather than being decrypted by the AP. So unless you’re using an open network all your client traffic is encrypted until it pops out of the controller. Lovely.

Control plane traffic, the AP talking to the controller, is not. This isn’t usually a problem as we mostly trust the wires (whether we should or not is another matter).

In many Aruba controller environments all the user traffic is tunneled back to the controller in our data centre, which works very well especially when users are highly mobile around a campus.

However it’s also possible to bridge an SSID so the AP drops the traffic on a local vlan. This is desirable in some circumstances, one real world example being a robotics research lab needing to connect devices to a local subnet. In order to enable this you first have to switch on control plane security.

Switching CPsec on causes all APs to reboot at least twice so on an existing deployment it leads to 8-15 minutes of downtime. I switched CPsec on this morning for our network with approximately 2500 APs, it was a tense time but went well…. mostly.

I’d read stories of APs with a bad trusted platform module being unable to share their certificate with the controller. We have a lot of APs from many years old to brand new and even a low percentage failure rate would present a problem.

In the end two APs failed, with lots of TPM initialization errors. However all our AP-203H units failed to come back up and the controller process logs started showing things like this:

Sep 4 07:57:37 nanny[1399]: <303022> <WARN> |AP <apname>@<ip_addr> nanny| Reboot Reason: AP rebooted Wed Dec 31 16:01:41 PST 1969; Unable to set up IPSec tunnel to saved lms, Error:RC_ERROR_CPSEC_CERT_REJECTED

It took a little while to become clear just one model of AP was affected and it was all of them.

A bit of time with Aruba TAC later and it transpires it’s necessary to execute the command: crypto-local pki allow-low-assurance-devices 

This needs to be run on each controller separately, and saved of course.

The command is detailed as allowing non-tpm devices to connect to the controller. I’m not entirely clear what’s special about the AP-203H, presumably it doesn’t have a TPM, yet it does have a factory installed certificate. We also have some AP-103H units, which don’t have a TPM but they work just fine with a switch-cert. I suspect this is a bug in the firmware and Aruba OS is treating the 203H as if it has a TPM but as it doesn’t everything falls apart.

Clearly if you want the maximum possible security, allowing low assurance devices is presumably going to raise eyebrows. In our case we’re happy with this limitation.

May this post find someone else who one day searches for RC_ERROR_CPSEC_CERT_REJECTED ap-203h 

Where’s the data?

On my quest to learn this week I had the privilege to attend Peter Mackenzie’s Certified Wireless Analysis Professional class. When a colleague and I attended Peter’s CWNA class a couple of years ago he suggested that CWAP should the be next port of call after passing the CWNA exam. Initially I thought that was mainly because he wrote the book (and an excellent book it is too) but actually CWAP goes deeper into the protocol and builds an even better foundation for an understanding of how WiFi works.

Most people who’ve dabbled in networking are familiar with Wireshark. It’s a fantastic tool, and a great way of troubleshooting problems. With packet capture you can prove that client isn’t getting an IP address because it isn’t doing DHCP properly if at all, rather than it being a wider network issue…

Wired packet capture is usually easy. Mirror a port on a switch, put an actual honest to goodness old school hub in line (if you have one and can tolerate the lower bandwidth during capture), or if you have the resources get a fancy tap or dedicated capture device such as the Fluke Onetouch. Usually we have one of these methods available but for Wi-Fi it’s not quite so easy.

Mac users have these fantastic tools available, and there are good options for Linux users but for Windows folk life can be tough and expensive.

Wi-Fi drivers for Windows tend not to expose the necessary level of control to applications. So you’re left needing a wireless NIC with a specific chipset for which there’s a magic driver, or getting hold of an expensive dedicated capture NIC.

There is one option that I’ve played with in the form of Tarlogic’s Acrylic WiFi. This affordable software includes a ndis driver that interfaces with the Wi-Fi NIC and presents monitor mode data to applications. Analysis within Acrylic is fairly poor but it will save pcap files and it’s possible to use the Acrylic driver directly in Wireshark.

The problem is that many new drivers don’t interface with ndis in Windows as things used to so as a result there’s a shrinking number of available Wi-Fi NICs that still work.

Some time ago I bought an Asus AC53 USB NIC which is on the Acrylic supported list and it is possible to install the old windows8.1 drivers on windows 10. However this doesn’t support DFS channels, which is a problem because we use DFS channels.

Fear not though it is, just about, possible to make this work. The Netgear A6200 uses the same Broadcom chipset and supports DFS channels. Once installed I was able to select the netgear driver for the Asus device and sure enough it works.

Which is a long lead in to this little tidbit, a key take away from Peter’s course this week.

When looking at a wireless capture it’s important to remember you might not be seeing all the information. Physical headers are stripped off by the hardware of the Wi-Fi interface, so you can’t see those. Wi-Fi does certain things with physical headers only, such as Mu-Mimo channel sounding and aspects of this are therefore not visible to a protocol analyser.

It probably goes without saying that you can’t see encrypted data, but you can usually see that it’s been transmitted… Unless your wireless interface can’t decode it.

Let’s say, for example, your AP is using a 40MHz DFS channel and the capture setup you’re using can’t be configured to use 40MHz channels. In this scenario, because management and action frames are all transmitted on the primary 20MHz channel you can see these just fine, yet all the higher rate data frames that take advantage of the full 40MHz just disappear.

The result looks a bit like the picture here… A client sends RTS, the AP responds CTS and the next thing is a block acknowledgement but no data.

It’s sometimes possible to see the data transfer betrayed in the time between the packets but here, because the data rate is high and the traffic is light, it’s not particularly apparent.

I’m particularly looking forward to spending more time digging into protocol analysis and hopefully getting some better tools.