Getting DHCP right

The second in a short series on the wider network services we need to get right in order to offer a good user experience to our Wi-Fi clients – first I mused about DNS, this time it’s Dynamic Host Configuration Protocol or DHCP for IPv4.

Put simply DHCP is what assigns your device an IP address when it joins a network. I’m not going into detail on how to configure it, the focus here is what I’ve seen go wrong in the real world.

Ensure your IP address space and DHCP scope is large enough for the intended number of clients. For example a coffee shop with a peak of 20 clients would be just fine using a /24 subnet that allows for a total of 253 clients (after accounting for the router address) whereas a 17,000 seater stadium would need a substantially larger subnet. Don’t short change yourself here, make sure there’s plenty of room for growth.

Pool exhaustion due to long lease duration. When the DHCP server runs out of IP addresses to hand out, that’s known as pool exhaustion. Consider the coffee shop with an ISP provided router which offers an address in a /24 subnet. That’s fine for the first 20 customers of the day, and the next 20, and so on, but a busy shop could soon have a lot of people come through and if enough of them hop onto the network the DHCP pool could run out – this is especially the case if the pool isn’t the full 253 addresses but maybe only 100. The simple fix for this is to set a lower lease time for DHCP, 1 hour would likely be sufficient but beware of short lease times having an impact on server load in some circumstances.

The client needs to be able to reach the DHCP server. A common deployment of captive portals moves the user into a different role after authentication. I have encountered networks where the authenticated role blocked all RFC-1918 addresses as a catch all to prevent access to internal services however this would prevent the client from renewing an IP address. Much unpredictability ensued. The solution was simply to allow DHCP traffic to reach the DHCP servers.

DHCP server hardware capacity. DHCP can get really complicated and tied into expensive IPAM products such as infoblox. For most deployments this isn’t necessary and the hardware requirements are usually not significant enough to be a concern. However this can be context dependent. A busy network with lots of people constantly coming and going likely has a very low peak DHCP request rate. A stadium network where lots of people arrive very quickly may see peak demand that requires a little more horsepower – as with DNS, keep an eye on the server load to understand if hardware limits are being reached. In practice very modest hardware can meet the DHCP demand of many thousands of users.

Multiple server syncronization is where more than one server shares the same pool, best practice in larger deployments for redundancy but it’s something I have seen go wrong with the result the same IP address is offered to more than one client. Fixing this is getting too far into the weeds and will be implementation specific, it’s enough to know that it absolutely shouldn’t happen and if the logs suggest it is, that’s a serious problem that needs someone to fix it.

The DHCP server simply stops working. Yep, this can and does happen. It’s especially a problem in some of the more affordable hardware solutions such as ISP provided routers. I encountered a Mikrotik router being used for DHCP on a large public network and from time to time it would just stop issuing IP addresses to random clients before eventually issuing no leases at all. A reboot always resolved this and I’m sure newer firmware has fixed this. There was often a battle with the owner of this to get them to restart it because “it was routing traffic just fine” and, yes, it was. It just wasn’t issuing IP address leases any more.

Why is it always DNS?

When we’re working on the Wi-Fi it can be easy to overlook some of the basic network services that are essential to client connectivity. These might be managed by someone else or just not your bag if you’ve spent all your time learning the finer points of 802.11. So here’s the first of a few short pieces looking at these elements of building a good Wi-Fi network, focussing this time on Domain Name System or DNS.

Put simply DNS is the thing that resolves a human friendly wordy name such as wifizoo.org into an IP address such as 178.79.163.251 (IPv4) and 2a01:7e00::f03c:91ff:fe92:c52b (IPv6).

When DNS doesn’t work the whole network can appear to be down. No DNS means your browser can’t resolve google.com to an IP address so can’t send any requests. Many network clients check a test web server can be reached to confirm internet connectivity – if it can’t be resolved the client will report no internet… Poor DNS performance can make the fastest network appear slow to the end user.

What I’m stressing is DNS needs to work and it needs to be responsive and reliable.

What I’ve seen go wrong

Right size the server – Too many clients hitting a server with insufficient resources is a bad time that you will likely only see at peak usage. A coffee shop with 20 clients is fine with the ISP router as the local DNS server. A stadium with 10,000 clients needs a caching DNS server capable of handling the peak requests per second. Be aware of what your DNS server is and what else the server might be doing. Be aware of system resource utilization (CPU, memory, etc) at peak times to understand if the hardware is reaching capacity.

Have more than one DNS server. Like right sizing, this depends on context. Again a coffee shop will have a single DNS server in the form of the ISP router, which is a single point of failure no matter what. A larger network with redundant switching and routing should have at least two DNS servers issued to clients and these should be in separate locations – you’re aiming to ensure DNS is available in the event of a failure somewhere. I have encountered a situation where two DNS servers were VMs running in the same DC which lost power. Someone forgot to pin the servers to specific hosts.

Public DNS servers rate limiting – The network became slow and unreliable at peak times but airtime utilization was not the problem. Say you decide to use a public DNS such as Google’s 8.8.8.8 or Cloudflare’s 1.1.1.1 on a public Wi-Fi network that sends all outbound traffic from a single NAT’ed public IP address. You run the risk of the DNS being rate limited. I’ve seen this happen and there is very little to no documentation about thresholds. Use either an internal server or a paid DNS service for public networks, which can also bring benefits of simple filtering access by reputation (adult, gambling, etc) and known malware domains.

Monitor DNS on public Wi-Fi. Use something like HPE Aruba UXI or Netbeez that sits as a client on the network and runs regular tests. This can provide visibility into problems like high DNS latency or failure and log this against a time stamp that will help diagnose issues related to overloaded or rate limited DNS.

Upstream servers are rubbish. Lots of complaints about “slow” network but throughput and latency was always fine. Issue was a poorly performing upstream DNS which took long enough to resolve anything not in the local server cache that it would often time out. For internal DNS servers consider what is being used upstream. If your high performing local DNS server is forwarding requests it can’t answer to a poorly performing ISP DNS server you’ll still have a bad time.

My personal recommendation for any public Wi-Fi solution is to use a local caching DNS server. Unbound DNS is a good option. This is easily deployed on linux. It’s built into opnsense open source firewall/router, which is an easy way to deploy if you just want an appliance. I will keep coming back to opnsense for other elements of this series as it’s often a great solution. The default Opnsense configuration of Unbound will use the DNS root servers to locate authoritative answers to queries. You can also forward requests to specific upstream servers.

It’s key to understand the client experience. There can be a temptation to see hundreds or thousands of clients on the network and plenty of data moving as justification to minimize user complaints of poor performance – however it might just be that DNS is letting down your excellent RF design.

ClearPass Guest DoB Regex

I don’t like the ClearPass date/time picker because it picks a time whether you want to or not, which is confusing, and if you’re as old as I am having to click back to find the correct year is tiresome.

Here’s a handy regex validator for UK format date of birth. This is an amalgamation and tweak of several expressions I found when setting up a captive portal for a client.

This format assumes day month year, as commonly used in the UK. It will validate against double or four digit years and allow either dd/mm/yy or dd-mm-yy.

The aim was to be as flexible as possible in the format used.

Clearly it will also validate if someone puts in a date of 4th September 1984 as 09/04/84 so we’re not being super precise about things but this is was deemed good enough and it’s less likely a UK user would use the wrong format.

Although the expression is less prescriptive than this, a validation error was written for clarity so the user it given an example that will validate.

/^(3[01]|[12][0-9]|0?[1-9])(\/|-)(1[0-2]|0?[1-9])\2([0-9]{2})?[0-9]{2}$/

How to do Wi-Fi in your home

tl;dr: stick an AP in your loft.

I have lived in three fairly different houses through my adult life: a very small 1930s terrace, a 2018 new built semi and a larger 1970s detached property.

In all of these I had the same problem. Placing a Wi-Fi router at the location of the incoming internet connection resulted in compromised coverage.

When creating an enterprise Wi-Fi design using the fabric of the structure to block signals is often key to channel re-use. In a domestic situation this is much more challenging because you probably haven’t got lots of structured cat6 cabling.

The simple reality is walls in many houses block Wi-Fi to a greater or lesser extent. My tiny terrace house had all solid walls with 25dB attenuation. The new build was timber frame with all internal drywall but to comply with UK fire regs this was lined with foil which, again, did a number on RF propagation. The 1970s house has practically RF transparent drywall upstairs and extremely RF obstructive blockwork downstairs.

To get around this common issue ISPs and various manufacturers have come up with complicated Wi-Fi mesh products. These are better than the O.G. Wi-Fi extenders of old, but in practice not much.

I’ve spent a LOT of time trying to help a friend position his BT Wi-Fi discs just so around the house so they have a good connection back to the router, possibly via each other, and can be plugged in, and are not in a stupid place – not easy.

So what’s the answer? Much like working on a 16th Century building that was all thick stone walls and wooden floors, floors and ceilings offer much less of an obstacle to RF than walls in many situations.

Treat these houses as individual floors and they each need two to four APs to provide coverage, fairly ridiculous for the square metres we’re talking about. However RF travels in three dimensions, not two. Placing one AP in the loft/attic provides great coverage across the whole house.

When the 1970s house was refurbished CAT6a was installed throughout, because I’m a network engineer that’s why, including into the loft and, yet again, despite planning for additional APs, I’ve found a single AP at the top of the house provides great coverage and performance.

So if you’re struggling to get Wi-Fi coverage across your house, before you start running cables to rooms, deploying APs over powerline all over the place, or breaking out the mesh, try putting the router in the attic.

Practically, a good use of the mesh approach would be to stick one or two mesh APs in the attic, ensuring they can both communicate with the router.

good luck 🙂

SD-Branch meshing slowness

Aruba SD-Branch supports branch meshing which, as the name suggests, allows branches to build an IPsec tunnels between branches and share routes directly. This is useful if you have server resources within a branch that need to be accessed from other sites. The concept is that it’s more efficient for traffic to flow directly between sites rather than via the VPNC in the company data centre or cloud service.

Whilst this all makes complete sense, it’s worth considering that not all ISPs are equal – of course we know this – and not all ISP peering is quite what we might expect.

I have recently worked on a project where branch mesh is occasionally used and the customer experienced significant performance problems with site B accessing servers on site A when the mesh was enabled.

The issue was down to ISP peering. Site A is in country1, Site B is in country2 and the VPNC is in country3. Traffic from ISPs on both sites to the VPNC was as fast as it could be. Both ISPs generally performed extremely well but as soon as traffic was routed between them the routing was weird with very high latency.

Because the ISPs on both sites were performing well in all other respects, reachability and performance tests all looked good. The gateways therefore happily used the branch mesh for the traffic between the two sites and the user experience was horrible.

Short term fix was to disable mesh between these branches. Long term fix was to change ISP at one of the sites. The customer did try raising a case with both ISPs. One engaged and at least tried to do something, the other didn’t… guess which was replaced.

Copper POTS is dead, long live fibre!

The Plain Old Telephone Service that’s been around in the UK since the late 19th century is about to be discontinued. This isn’t happening everywhere all at once, it’s a phased thing on an exchange by exchange basis, but ultimately if you currently have a basic phone line this is going to stop working.

There are concerns around this, mostly centered on elderly people who still use a landline and what happens when the power goes out? The argument goes that in an emergency, e.g. during a storm, in a power cut, we would not want to leave people without the ability to call for help.

There are other issues around the loss of analogue lines to do with monitoring systems and alarm, but these can pretty much all be mitigated with VOIP adapters. The real issue is about reliable service in an emergency.

It’s hard to know how much of a problem this actually is. Whilst I don’t doubt there are plenty of elderly people for whom the landline is important, my first question is how many have a cordless phone? These have a base station that needs mains power. I have never seen a domestic unit with battery backup. In a power cut these do not work, but a secondary wired phone can be present.

Then there’s the question about line damage. The circumstances that lead to power outages can often also result in damage to telephone lines. It doesn’t matter that the exchange is still working if you’re no longer connected to it.

If your analogue landline is discontinued it will be replaced with either FTTP or a DSL circuit over copper pair. To maintain a phone service means using VOIP – either dedicated phone hardware or some sort of VOIP adaptor, possibly built into the router.

The most obvious solution is a low cost UPS. To maintain service in an emergency1-3 devices would need to be powered. These are not going to present very high current draw and a small UPS would probably keep things working for several hours – albeit with annoying beeping which itself is likely to be an issue.

Debate I’ve seen is around who is responsible for this. I understand why this is raised because currently an basic analogue telephone is powered by the exchange. The thing is it’s possible to consider an analogue telephone as part of the exchange as when in use it completes the circuit through the line. As this service is withdrawn it becomes necessary to power devices locally – as it already is with a cordless phone connected to an analogue line.

UK domestic comms has moved from circuit switched analogue voice telephony to broadband packet switched IP. The way analogue telephones work, at least between the exchange the home, is essentially the same now as it was in the late 1800s and it makes complete sense to end this service now that it is no longer used as the primary means of communication.

Something that doesn’t get a lot of mention in the press coverage around this is parts of the phone network is getting old. Nothing lasts forever and periodically the cabling infrastructure in the ground and the equipment in exchanges needs to be replaced. We have gone from the human operator through several generations of mechanical switching (this link provides a wonderful explainer on how an old mechanical telephone exchange works) to electronic analogue switching through everything being digital links and now it’s all IP between exchanges anyway. Having to maintain the old, just because it’s been there a long time does not make financial sense.

Given what the communications network is now used for it makes no sense to put new wiring in. If streets and driveways are to be dug up to replace old cabling it’s much better to replace them with fibre – and that’s what’s happening.

The problem here is one of change and how that change is managed. BT have gone from offering to provide battery backed up equipment to not doing that to postponing the copper pair migration in areas because it turns out they hadn’t worked this out.

I have seen a lot of people claiming that the good old analogue phone is simple, reliable and should be maintained for that reason. I can see the logic of that though I would argue it’s only the user end that’s simple.

Perhaps there’s a market for a simple phone VOIP terminal that has built in DSL, ethernet, doesn’t need a separate router and can supply power (12v outlet) for the PON terminal – nice big LI-PO battery inside. Get that right an BT can probably then have a standard hardware item to issue that will just work, then the rollout can proceed.

Wi-Fi7 – rainbows, unicorns and high performance

Like every technological advancement that leads directly to sales of new hardware, Wi-Fi7 promises to solve all your problems. This, of course, will not happen nevertheless it’s now available ***in draft form*** so should you buy it?

No.

Ok let me qualify that. Not at the moment unless you are buying new Wi-Fi hardware anyway, in which case maybe.

The IEEE specification behind Wi-Fi7 is 802.11be and it isn’t finalized yet. That means any Wi-Fi7 kit you can buy right now is an implementation of the draft specification. Chances are that specification isn’t going to change much between now and when it’s finalised (expected end of 2024) but it could. There’s nothing new here, vendors have released hardware based on draft specs for the last few major revisions of the 802.11 Wi-Fi standards.

Perhaps more important is in a rush to get new hardware on the shelves what you can buy now is a Wi-Fi7 wave1 which doesn’t include some capabilities within the specification. As we saw with 802.11ac (Wi-Fi5) the wave2 hardware can be expected to be quite a lot better – it will support more of the protocol’s options, chances are the hardware will be more power efficient too – personally I’d wait.

Something that’s important to remember about every iteration of Wi-Fi is that it almost certainly won’t magically solve whatever problem you have that you believe is caused by the Wi-Fi. Client support is also very sparse right now, so swapping out your Wi-Fi6 access point/Wi-Fi router for Wi-Fi7 hardware probably won’t make any difference at all.

As with all previous versions many of the benefits Wi-Fi7 brings are iterative improvements that aim to improve airtime usage. These are definitely worth having, but they’re not going to make the huge difference marketing might have you believe.

The possibility of 4096 QAM (subject to really high SNR) allows for higher data rates – all other things being equal. 512 MPDU compressed block-ack is a complex sounding thing that ultimatley means sending a bigger chunk of data at a time and being able to move more data before acknowledging – which is more efficient. Channel bonding is enhanced with 320MHz channels now possible and improvements on how to handle a channel within the bonded range being used by something else. All very welcome (apart from maybe 320MHz channels) and all iterations on Wi-Fi6.

The biggest headline boost to performance in Wi-Fi7 is Multi-Link Operation – MLO. For anyone familiar with link-aggregation, what Cisco calls Port-Channel, the idea of taping together a number of links to aggregate bandwidth across those links as a single logical connection – MLO is basically this for Wi-Fi radios.

That 2.4GHz band that’s been referred to as dead for that last however many years, now can be duct taped to 5GHz channels and you get extra bandwidth for your connection. You might expect you could also simultaneously use 5GHz and 6Ghz and you can… in theory, but none of the vendors offering Wi-Fi7 hardware support that right now. Chances are this is something that will come in the wave2 hardware, maybe a software update…. who knows.

There are benefits to MLO other than raw throughput – a device with two radios (2×2:2) could listen on both 5GHz and 6Ghz (for example) and then use whichever channel is free to send its transmission. This can improve airtime usage on busy networks and reduce latency for the client. Devices switching bands within the same AP can also do so without needing to roam (currently moving from 2.4GHz to 5GHz is a roaming event that requires authentication) and this improves reliability.

Key to MLO is having Multi-Link Devices. Your client needs to support this for any of the above to work.

Wi-Fi7 has a lot to offer, builds on Wi-Fi6 while introducing technology that paves the way for further signficant improvements when Wi-Fi8 arrives. There’s a lot of potential for Wi-Fi networks to get a lot faster with a Wi-Fi7 deployment.

Returning to my initial quesiton… Personally I wouldn’t buy Wi-Fi7 hardware today unless I already needed to replace my equipment. Even then, domestically I’d probably get something to fill the gap until the wave2 hardware arrives. If everything is working just fine but you’d like to get that bit more wireless speed, chances are Wi-Fi7 isn’t going to deliver quite as you might hope. Those super speeds need the client to be very close to the AP.

ClearPass extenstion restart behaviour

To address occasional issues with a misbehaving extension entering a restart loop some changes were made in ClearPass 6.11. This can result in an extension stopping when it isn’t expected and, crucially, not restarting again.

A restartPolicy option has been added which aims to ensure extensions will always restart when the server or extensions service restarts. A good practice is to add “restartPolicy” : “unless-stopped” to your extension configuration – note that I have only used this with the inTune extension. Below are the options available.

  • “restartPolicy”: “no” – The Extension will not be automatically restarted after the server is restarted.
  • “restartPolicy”: “always” – The Extension will always be restarted after the server is restarted.
  • “restartPolicy”: “unless-stopped” – The Extension will be restarted unless it was stopped prior to the server restart, in which case it will maintain that state.
  • “restartPolicy”: “on-failure:N” – If the Extension fails to restart, the value for “N” specifies the number of times the Extension should try to restart. If you do not provide a value for “N”, the default value will be “0”.

Whilst the default behaviour ought to effectively match the “unless-stopped” policy, in my experience there can be issues with extensions stopping unexpectedly. Prior to release 6.11.5 a bug an unrelated service would restart the extensions service, and this results in stopped extensions. Whilst this should be resolved, I have still run into the problem. Adding “restartPolicy” : “unless-stopped” resolved this issue.

ClearPass Guest pages in a specific language

If you’ve made use of language packs in ClearPass Guest, you’ll know that it’s possible to support multiple languages across Guest in both customer facing pages and the back end. Everything will use whichever language you have set as default and then you can provide the option of choosing an alternative to the user. There’s also the option of enabling language detection where ClearPass will hopefully match the language used to the user’s system settings – this can be found in the Language Assistant within ClearPass Guest.

This works very well and is going to meet most requirements but there are some edge cases where it may be desirable to have some guest pages that open in a language different to the back end default.

Take the example of regional languages that are especially important to a subset of users but might not have wide operating system support. ClearPass Guest offers language customisation allowing use of a language that isn’t built in to the product. In such an example it might be a requirement to use the regional language as default for a captive portal but administrators of the system may not speak the language – it’s also worth noting that if translations don’t exist for a selected language ClearPass can exhibit some buggy behaviour with elements of the back end UI no longer working (as of release 6.11.5).

Whilst there may be alternative methods, one option is using the language selector to redirect users to the appropriate page and language.

Under the hood what the language selector drop down actually does is get translation_lang.php with variables of the destination page (usually the page you’re already on) and language. You can use this as your captive portal redirect to directly access the language of choice.

As an example, if you want the self-registration login page in Klingon it’s something like this:
“https://CPPM/guest/translation_lang.php?target=guest_register_login.php&lang=tlh”

Adjust to match your server address, the desired page and language pack.