Getting DHCP right

The second in a short series on the wider network services we need to get right in order to offer a good user experience to our Wi-Fi clients – first I mused about DNS, this time it’s Dynamic Host Configuration Protocol or DHCP for IPv4.

Put simply DHCP is what assigns your device an IP address when it joins a network. I’m not going into detail on how to configure it, the focus here is what I’ve seen go wrong in the real world.

Ensure your IP address space and DHCP scope is large enough for the intended number of clients. For example a coffee shop with a peak of 20 clients would be just fine using a /24 subnet that allows for a total of 253 clients (after accounting for the router address) whereas a 17,000 seater stadium would need a substantially larger subnet. Don’t short change yourself here, make sure there’s plenty of room for growth.

Pool exhaustion due to long lease duration. When the DHCP server runs out of IP addresses to hand out, that’s known as pool exhaustion. Consider the coffee shop with an ISP provided router which offers an address in a /24 subnet. That’s fine for the first 20 customers of the day, and the next 20, and so on, but a busy shop could soon have a lot of people come through and if enough of them hop onto the network the DHCP pool could run out – this is especially the case if the pool isn’t the full 253 addresses but maybe only 100. The simple fix for this is to set a lower lease time for DHCP, 1 hour would likely be sufficient but beware of short lease times having an impact on server load in some circumstances.

The client needs to be able to reach the DHCP server. A common deployment of captive portals moves the user into a different role after authentication. I have encountered networks where the authenticated role blocked all RFC-1918 addresses as a catch all to prevent access to internal services however this would prevent the client from renewing an IP address. Much unpredictability ensued. The solution was simply to allow DHCP traffic to reach the DHCP servers.

DHCP server hardware capacity. DHCP can get really complicated and tied into expensive IPAM products such as infoblox. For most deployments this isn’t necessary and the hardware requirements are usually not significant enough to be a concern. However this can be context dependent. A busy network with lots of people constantly coming and going likely has a very low peak DHCP request rate. A stadium network where lots of people arrive very quickly may see peak demand that requires a little more horsepower – as with DNS, keep an eye on the server load to understand if hardware limits are being reached. In practice very modest hardware can meet the DHCP demand of many thousands of users.

Multiple server syncronization is where more than one server shares the same pool, best practice in larger deployments for redundancy but it’s something I have seen go wrong with the result the same IP address is offered to more than one client. Fixing this is getting too far into the weeds and will be implementation specific, it’s enough to know that it absolutely shouldn’t happen and if the logs suggest it is, that’s a serious problem that needs someone to fix it.

The DHCP server simply stops working. Yep, this can and does happen. It’s especially a problem in some of the more affordable hardware solutions such as ISP provided routers. I encountered a Mikrotik router being used for DHCP on a large public network and from time to time it would just stop issuing IP addresses to random clients before eventually issuing no leases at all. A reboot always resolved this and I’m sure newer firmware has fixed this. There was often a battle with the owner of this to get them to restart it because “it was routing traffic just fine” and, yes, it was. It just wasn’t issuing IP address leases any more.

Why is it always DNS?

When we’re working on the Wi-Fi it can be easy to overlook some of the basic network services that are essential to client connectivity. These might be managed by someone else or just not your bag if you’ve spent all your time learning the finer points of 802.11. So here’s the first of a few short pieces looking at these elements of building a good Wi-Fi network, focussing this time on Domain Name System or DNS.

Put simply DNS is the thing that resolves a human friendly wordy name such as wifizoo.org into an IP address such as 178.79.163.251 (IPv4) and 2a01:7e00::f03c:91ff:fe92:c52b (IPv6).

When DNS doesn’t work the whole network can appear to be down. No DNS means your browser can’t resolve google.com to an IP address so can’t send any requests. Many network clients check a test web server can be reached to confirm internet connectivity – if it can’t be resolved the client will report no internet… Poor DNS performance can make the fastest network appear slow to the end user.

What I’m stressing is DNS needs to work and it needs to be responsive and reliable.

What I’ve seen go wrong

Right size the server – Too many clients hitting a server with insufficient resources is a bad time that you will likely only see at peak usage. A coffee shop with 20 clients is fine with the ISP router as the local DNS server. A stadium with 10,000 clients needs a caching DNS server capable of handling the peak requests per second. Be aware of what your DNS server is and what else the server might be doing. Be aware of system resource utilization (CPU, memory, etc) at peak times to understand if the hardware is reaching capacity.

Have more than one DNS server. Like right sizing, this depends on context. Again a coffee shop will have a single DNS server in the form of the ISP router, which is a single point of failure no matter what. A larger network with redundant switching and routing should have at least two DNS servers issued to clients and these should be in separate locations – you’re aiming to ensure DNS is available in the event of a failure somewhere. I have encountered a situation where two DNS servers were VMs running in the same DC which lost power. Someone forgot to pin the servers to specific hosts.

Public DNS servers rate limiting – The network became slow and unreliable at peak times but airtime utilization was not the problem. Say you decide to use a public DNS such as Google’s 8.8.8.8 or Cloudflare’s 1.1.1.1 on a public Wi-Fi network that sends all outbound traffic from a single NAT’ed public IP address. You run the risk of the DNS being rate limited. I’ve seen this happen and there is very little to no documentation about thresholds. Use either an internal server or a paid DNS service for public networks, which can also bring benefits of simple filtering access by reputation (adult, gambling, etc) and known malware domains.

Monitor DNS on public Wi-Fi. Use something like HPE Aruba UXI or Netbeez that sits as a client on the network and runs regular tests. This can provide visibility into problems like high DNS latency or failure and log this against a time stamp that will help diagnose issues related to overloaded or rate limited DNS.

Upstream servers are rubbish. Lots of complaints about “slow” network but throughput and latency was always fine. Issue was a poorly performing upstream DNS which took long enough to resolve anything not in the local server cache that it would often time out. For internal DNS servers consider what is being used upstream. If your high performing local DNS server is forwarding requests it can’t answer to a poorly performing ISP DNS server you’ll still have a bad time.

My personal recommendation for any public Wi-Fi solution is to use a local caching DNS server. Unbound DNS is a good option. This is easily deployed on linux. It’s built into opnsense open source firewall/router, which is an easy way to deploy if you just want an appliance. I will keep coming back to opnsense for other elements of this series as it’s often a great solution. The default Opnsense configuration of Unbound will use the DNS root servers to locate authoritative answers to queries. You can also forward requests to specific upstream servers.

It’s key to understand the client experience. There can be a temptation to see hundreds or thousands of clients on the network and plenty of data moving as justification to minimize user complaints of poor performance – however it might just be that DNS is letting down your excellent RF design.