Here’s a quaint little problem that hit our network recently… We ran out of space on the ARP table of our core routers.
“What the hell?” I hear you say, “what is this, 2003?” and well you might ask.
Over that last couple of years we’ve been upgrading our campus network, which represents a very large number of switches, from primarily HP Procurve 2600 series edge with 5400zl doing the OSPF routing to primarily the comware range with HPE 5130 edge and 5900 doing the OSPF then 5930s at the core to replace the previous 5900 (setup in a terrible design that meant we could never upgrade them without bringing down everything). It’s a slow process because we try not to break things as we migrate them and, as I mentioned, it’s a lot of switches.
Whilst we have a lot of edge ports, our network isn’t really all that complicated, there’s just a lot of edge. So an IRF pair of 5930s as the core router in each datacentre seemed to be just fine.
Very recently we upgraded our Aruba mobility controllers and moved them off the 5400 switches that have been handling all our WiFi traffic and on to some of the new kit in our datacentres.
So far so good.
We then slowly moved the subnets from the 5400s to our 5930s. This work was well over 50% complete when one morning, just near the start of the university term, we seemed to have a problem.
Some WiFi users seemed to have no connectivity. We quickly established that plenty of traffic was flowing from the WiFI controllers and through our off site link. Whilst the problem seemed to be fairly serious, it wasn’t affecting hundreds of people, as far as we could tell.
Theories flew round the office as we tried to understand what was happening and why some users seemed to have a perfectly good WiFi connection and layer2 traffic was passing, but they couldn’t do anything useful…. There seemed to be no pattern, and a broken user would suddenly start working with nothing having changed.
It was spotted that we had 16,384 entries on the arp table of our 5930s and this was initially dismissed as a small number, but one of my brilliant colleagues pointed out that it was a rather neat, round number and that wasn’t likely to be a good thing.
So it turns out that all of the comware switches we’re using as routers, 5510, 5900 & 5930 have a max arp table size of 16,384.
As this term has kicked off we’ve seen higher numbers on our WiFi alone, and the 5930s are also routing for all the servers in our datacentres.
This was a pretty basic problem. We’d just filled up the tables and our routing switches were no longer doing the business.
This issue caught us out because, as previously mentioned, this is quite a small number. The 5930 will handle 288K mac addresses and a route table of over 100K. More significantly the decade old switches we were replacing could handle it.
Another reason this slipped past is the ARP table size doesn’t appear on the spec sheets of many switches.
We just assumed these very capable datacentre switches had the horsepower and memory allocation to do what we needed, and assuming made an ass of whoever.
Cisco fans will tell you they’ve long had the ability to allocate the finite amount of memory in a layer3 switch to the tables you need. Fortunately this functionality is now being made available to the 5930 (don’t know about the other models).
In our case this means we can reduce the routing table size (I don’t think we need 10K, never mind 100K+) and give more room for arp entries. We can then try again to move the WiFi subnets and, hopefully avoid problems.
The lesson from this has to be the importance of understanding the spec of a box you’re putting at the centre of a network. I can comment on the importance of assessing the implications of a change, such as moving the routing for 16K clients, but to be honest we’d still have assumed the big switches could handle this.
So, roll on the disruption of multiple reboots to bring our 5930s up to the software version that will do what we need, and in the meantime the venerable 5400zl continues to just work.
Finally, I should stress that our network is really very reliable. Serious outages used to be relatively common place and I can’t remember when we last experienced an unexpected widespread outage. This, again, is why we were caught out by this. The 5930s have just been rock solid and they spoiled us.