10.5.3 Fixes DNS Problems Plaguing Some Leopard Users|
June 3, 2008
When Leopard showed up last fall, I began receiving reports of network problems from users of my Wx weather app. The common theme was a timeout during name lookup. All in all, I'd say about 10-15% of users were affected by this, which is a lot when there are a few thousand users and only one developer to handle tech support.
I had tangled with name lookup issues during early stages of Leopard development, and learned that Leopard changed the way name lookups were made. In OS X 10.4 and earlier, lookups requested a very simple "A" (address) record from a domain name system (DNS) server. In this type of lookup, the DNS server returns a 32-bit IPV4 address that basically tells your computer what numerical IP address (like 18.104.22.168) is associated with a host name (like www.example.com). The DNS A record is one of the simplest (and most important) forms of lookup that makes the internet run. It's sort of like dialing the operator and asking for the phone number of a business you want to reach.
In early versions of 10.5, name lookups were changed to request an "SRV" (service locator) record. This is a more comprehensive type of record that not only maps IP addresses to names, but can also map IP addresses to various "services" within a domain (for instance, multiple/mirrored web servers, e-mail servers, LDAP servers, etc). Extending the phone analogy above, it would be like having the operator give you the phone number for a business *and* the extension for the specific department you want to reach.
Though the SRV record makes a lot of sense and has been a recommended standard for over 8 years, there are still DNS servers on the internet that don't support it! When early versions of 10.5 tried to request an SRV record from these servers, some would be smart enough to reply with a "can't do it" (triggering the OS to request an A record instead), but others wouldn't respond at all. And so the OS would wait 30 seconds (default timeout for name lookups unless overridden by apps) and then try again. After a few SRV failures, the OS would revert to an A record request. But by then most users had either cancelled the request or fallen asleep.
In some cases, the OS would cache the result of the failed SRV request and future requests would avoid the problem. And apps like Safari would mitigate the issue by making multiple DNS requests in parallel, increasing the chances of success. But for many users, network access slowed to a crawl and timeouts were common. Some users reported increasing the network timeout in Wx to as high as 90 seconds (the max allowed) and still seeing issues on a regular basis.
Now, imagine the tech support headaches this would cause. Some developers handled this at a low level, by over-riding the default behavior of the "curl" library used for HTTP operations (this involves getting curl to use the "gethostbyname" function instead of the more versatile "getaddrinfo"). Since Wx was accessing curl one layer up, through UNIX processes, this was not a good option for me without dramatically changing the way Wx does HTTP operations. So I slogged through it on a case by case basis, treating the problem as a DNS configuration issue on the user's side.
In some cases, this worked great. I can't tell you how many users discovered old, outdated, or out-of-service DNS settings on their Macs or network gear. Some people had been carrying DNS baggage for years with no idea, and this had likely been hampering name lookup operations at some level for a long time, even before 10.5 came along.
In other cases, it was a major challenge. The average user doesn't know (and probably shouldn't need to know) where their DNS settings are, or what they do. And then consider the common situation of a user with a Mac connected to an Airport connected to a cable modem, with various DHCP leases sprinkled into the mix. In most of those cases, DNS settings are provided far upstream and inherited down through the line. The user may see *no* DNS settings on the Mac, or it might be using the address of the Airport router. The Airport might contain DNS settings inherited from the cable modem, which it picked up from the ISP through the DHCP lease. The issue then becomes one of figuring out where to purge old DNS settings, if that's even possible, and replacing or overriding them with known good ones (such as from OpenDNS.org). It was a challenge in many cases. Just convincing the user they had a DNS issue was a challenge in some cases!
Anyway, 10.5.3 shows up last week, and I get an e-mail from a user who claimed it miraculously solved all his DNS problems. I scoured the release notes and developer notes, but no mention of name lookup changes was made. So I updated one of my systems to 10.5.3 and did some snooping with "tcpdump" on port 53. Lo and behold, OS X is now starting off name lookups with A record requests again, just like it did in 10.4 and earlier! I'd like to think that 10.5.3 fixed the DNS issue, but what it really did was revert to the old behavior. Maybe Apple can take another shot at SRV requests in 8 more years; perhaps by then, DNS servers will more widely support this standard.
Leopard DNS Issues (and work-around)
Leopard DNS Aiport Issue - Why + Fix
OSX Leopard, DNS, SRV, A, Oh My
Adium Ticket #8404
Wikipedia: Domain Name System
Wikipedia: DNS Record Types
RFC 2782: DNS SRV