It was another end of the workweek; what could possibly go wrong? Sure, Outlook had failed for a few hours earlier in the week and Twitter lost control of some big-name accounts, but surely nothing else could go awry? Right? Wrong. Bad things come in threes. Starting on Friday afternoon, Cloudflare, the major content delivery network (CDN) and Domain Name System (DNS) service, had a major DNS failure, and tens of millions users found their internet services failing.
At the time, there was concern that the US internet itself was under attack. The real problem was much more mundane. Cloudflare’s technical support reported:
“This afternoon we saw an outage across some parts of our network. It was not as a result of an attack. It appears a router on our global backbone announced bad routes and caused some portions of the network to not be available. We believe we have addressed the root cause and monitoring systems for stability now.”
Like, I said, whoops.
There’s an old saying in network administration circles that when something goes awry on the network, “It’s always DNS.” In this case, that’s exactly right. It was DNS.
Specifically, sites that use Cloudflare DNS hosting and anyone using Cloudflare’s free DNS 188.8.131.52 resolver service were knocked off the web for about half an hour. John Graham-Cumming, Cloudflare’s CTO explained in a blog posting:
The outage occurred because, while working on an unrelated issue with a segment of the backbone from Newark to Chicago, our network engineering team updated the configuration on a router in Atlanta to alleviate congestion. This configuration contained an error that caused all traffic across our backbone to be sent to Atlanta. This quickly overwhelmed the Atlanta router and caused Cloudflare network locations connected to the backbone to fail.
The error consisted of a single line of configuration code, but that was more than enough. Instead of removing the Atlanta routes from the backbone, a tiny change started leaking all Border Gateway Protocol (BGP) routes into the backbone. BGP is the standardized exterior gateway protocol used to exchange routing and reachability information between the internet top-level autonomous systems (AS). Average users never deal with these.
We ordinary mortals do, however, work with DNS all the time even if we’re not aware of it. DNS is the internet’s master phonebook. Whenever you type in or click a human-readable web link (such as zdnet.com), your web browser calls on a DNS resolver to resolve its corresponding Internet Protocol (IP) address.
DNS is not just for browsers, though. If it runs on the internet — Slack, email, you name it — DNS works behind the scenes to make sure all the application requests hook up with the appropriate internet resources. Whether a website, email link, or FTP site, it has an IPv4 address or its IPv6 address equivalent, and the 13 DNS master root servers track them all. These authoritative DNS servers hold the addresses for every internet-connected device in the world. DNS is essential. Without it, there is no internet. Period.
And, when DNS goes wrong, especially at a high-level, the result, as we just saw last week, is a near-complete work stoppage. Fortunately, such failures are rare.