Audinate Dante, with lots of switches, and latency

I’m sure someone once said you shouldn’t use commas in a headline or title but they’re not the boss of me and I don’t care. So, to business…. Here’s one of the rare posts that isn’t about the WiFi.

As someone who’s done my fair share of live sound and studio audio work, I’m something of a gear head. One of the big developments in the area of digital audio has been networked audio and by far the biggest, and some would say only truly relevant player, is Australian company Audinate with their product Dante.

For those that don’t know, Dante is a networked audio system that allows multiple unicast or multicast streams of high quality, uncompressed, sample-accurate audio to be routed between devices across a network. It’s based around custom chips that are designed and sold by Audinate then used by pretty much every serious audio pro audio equipment manufacturer in the world.

Perhaps the best thing about Dante is that it’s just another network device. It doesn’t need anything special to work, and it will work on standard network switches. It’s perhaps no surprise that Dante has become really popular in AV installs. You can even get POE powered speakers, so rolling out a PA system through conference system corridors can all be done just with a network cable. It’s neat stuff.

But this post is less about Dante itself, and how awesome it is, but more how it can impact design decisions in modern networks.

For this example I’m going to use a large lecture hall – a flexible space that can seat 1500 people and is used for conferences, lectures, public events, etc.

A key consideration of a digital audio system is latency – generally referring to how long it takes audio to make it’s way through a system from, say, a microphone input to the speaker output. The old analogue audio system had effectively no latency. Electrons being inconvenienced along the various mic cables, through the mixer, amplifiers and out to the speakers all happens extremely quickly. In digital systems we have analogue to digital conversion, which takes a bit of time and we  invariably introduce buffers, which hold things up a bit and leads to data waiting around. Almost every part of a digital audio system is slower than analogue. Add all these delays together and the system latency can easily get too long to be acceptable.

The same can be said about networking. Every network device your frame or packet passes through will add a bit of latency.

Dante moves audio across an Ethernet network, and switched Ethernet is really very fast (worth noting that in a standard Dante network doesn’t route across subnets). Dante conceptually supports incredibly low latency settings, but most of the chipsets have a minimum latency setting of 1ms. This means anything receiving a stream from that device will allocate a 1ms buffer and will therefore handle network latency of up to 1ms without any disruption to the audio. If the latency exceeds 1ms then you’re in trouble.

A brief blip might be barely noticeable, but sustained high latency will result in late packets being dropped and particularly unpleasant sounding audio stream.

Having discussed latency, and network switches, let’s get to the core of the issue…

Recently installed Dante equipment wasn’t working very well in the lecture hall. The network switches are HPE 5130 with IRF stacking. There are three switch stacks in the building, connected back to an IRF pair of HPE 5900 switches (acting as a router) with 2 x 10Gb links from each stack in a LACP. Stack1 has two switches, stack 2 has five and stack3 has nine switches.

From an administrative perspective there are three switches in the building, though of course we have to remember there’s really far more… why? because latency, that’s why.

Our problem dante devices are situated in mobile racks, each containing their own switch – this allows them to be easily unplugged and moved around.

It turns out one mobile rack was connected to stack 2 and the other was into stack 3. With 1ms latency configured Audinate reckon Dante is good for 10 gigabit switches or about three 100Mb switches (because Fast Ethernet has higher latency) but we were having problems, so just how many switches was our traffic passing through?

The trouble is, I don’t know. Because our Comware switches are in IRF stacks how traffic moves around between members is inside the black box and not available for analysis. Our stacks are a ring topology and we have two uplinks, I don’t know which way around the ring the traffic is going or which of the two uplinks my it’s using.

So I have to assume a worst case scenario, even if this shouldn’t actually happen. Here goes: My first mobile rack switch is connected to stack2 member 1 and that stack hashes my traffic to the second uplink which is on member 9. For some reason rather than taking the shortest route my traffic makes it’s way the long way around the ring, therefore passing through a total of five switches.

It then heads to the routing switch, also an IRF pair, so let’s assume we pass through both of them and on towards stack3. Again we have to assume the worse case scenario, that the traffic passes through all nine switches.

The end result is that it’s possible the traffic could traverse 18 switches.

It’s hard to put some figures on this. According to the HPE 5130 datasheet the gigabit latency is <5us. What I don’t know is whether that changes as more complex configs are applied. 10Gb latency is an even lower <3us. However we don’t know how and even if the IRF affects this. These switches are blisteringly, mind bendingly fast. Even a much cheaper switch is really, really quick, and it’s why we can build switched networks capable of incredible performance. All hail the inventor of the network switch, who’s name I can’t be bothered to look up.

Even though that blisteringly fast performance x 18 still adds up, but not to much. If our Dante traffic follows the worst case scenario, however unlikely, we still only get to something like 200us of latency. This should be absolutely fine, but it wasn’t reliable. I suspect a big part of this is general network conditions. The more physical switches your Dante traffic passes through, the higher the chances it could face moments of congestion.

Back to reality then. Increasing the latency setting of the Dante device would remedy this at the cost of increasing the total audio system latency. If that’s acceptable, it’s the best choice because it provides a healthy margin. The other alternative is to change the network patching so we limit the number of switches. In this case we’ve been able to keep the critical devices connected to stack3 and the very worst case scenario is we’re down to 11 switches and not touching the uplinks but in fact most Dante devices are linked to the same switch stack member.

At the end of this long, confusing mess of writing then I’m left uncertain quite what was causing the issue. Our theoretical latency was well below the minimum required and yet we ran into issues. Probably the most important takeaway for a network engineer is that when working with something as latency sensitive and uncompromising as Dante, it’s important to consider the number of devices in the network path… you know, just like the deployment guide says.

Aruba Control Plane Security and the AP-203H

Here’s a useful little tidbit for anyone with an Aruba OS 6.5 environment wanting to enable control plane security with some AP-203H on the network.

With Aruba campus APs all traffic between users and the controller is encrypted (probably) by virtue of the Wi-Fi encryption in use being between the client and the mobility controller, rather than being decrypted by the AP. So unless you’re using an open network all your client traffic is encrypted until it pops out of the controller. Lovely.

Control plane traffic, the AP talking to the controller, is not. This isn’t usually a problem as we mostly trust the wires (whether we should or not is another matter).

In many Aruba controller environments all the user traffic is tunneled back to the controller in our data centre, which works very well especially when users are highly mobile around a campus.

However it’s also possible to bridge an SSID so the AP drops the traffic on a local vlan. This is desirable in some circumstances, one real world example being a robotics research lab needing to connect devices to a local subnet. In order to enable this you first have to switch on control plane security.

Switching CPsec on causes all APs to reboot at least twice so on an existing deployment it leads to 8-15 minutes of downtime. I switched CPsec on this morning for our network with approximately 2500 APs, it was a tense time but went well…. mostly.

I’d read stories of APs with a bad trusted platform module being unable to share their certificate with the controller. We have a lot of APs from many years old to brand new and even a low percentage failure rate would present a problem.

In the end two APs failed, with lots of TPM initialization errors. However all our AP-203H units failed to come back up and the controller process logs started showing things like this:

Sep 4 07:57:37 nanny[1399]: <303022> <WARN> |AP <apname>@<ip_addr> nanny| Reboot Reason: AP rebooted Wed Dec 31 16:01:41 PST 1969; Unable to set up IPSec tunnel to saved lms, Error:RC_ERROR_CPSEC_CERT_REJECTED

It took a little while to become clear just one model of AP was affected and it was all of them.

A bit of time with Aruba TAC later and it transpires it’s necessary to execute the command: crypto-local pki allow-low-assurance-devices 

This needs to be run on each controller separately, and saved of course.

The command is detailed as allowing non-tpm devices to connect to the controller. I’m not entirely clear what’s special about the AP-203H, presumably it doesn’t have a TPM, yet it does have a factory installed certificate. We also have some AP-103H units, which don’t have a TPM but they work just fine with a switch-cert. I suspect this is a bug in the firmware and Aruba OS is treating the 203H as if it has a TPM but as it doesn’t everything falls apart.

Clearly if you want the maximum possible security, allowing low assurance devices is presumably going to raise eyebrows. In our case we’re happy with this limitation.

May this post find someone else who one day searches for RC_ERROR_CPSEC_CERT_REJECTED ap-203h