Facebook can’t be down, can it?
That was the immediate reaction from the engineers at web infrastructure firm Cloudflare, when they first noticed a problem accessing the social network on October 4. Indeed, in the moments after Facebook went dark, they immediately assumed the problem must be with their own equipment, not Facebook’s. Facebook doesn’t go down. The Titanic doesn’t sink.
It soon became clear, however, that Facebook had indeed disappeared. And not only Facebook itself, but Instagram, WhatsApp and anything else in the Facebook portfolio. “It was as if someone had ‘pulled the cables’ from their data centres all at once and disconnected them from the internet,” Cloudflare stated in a report published shortly after the outage.
It was six hours before Facebook’s sites started coming back online. Six hours in which Facebook sellers couldn’t trade, advertisers couldn’t reach users, and people couldn’t reach colleagues, friends and family via the social networks. The effects were felt beyond Facebook’s sphere too: Twitter struggled to stay upright following an influx of users wondering where Facebook had gone; DNS servers were flooded with requests for the missing sites; and the many sites that use Facebook credentials to authenticate users may as well have been offline themselves. Facebook sneezed and the internet caught a cold.
That’s simply not meant to happen. We’ve been told for decades that the internet is resilient, robust, that the distributed nature of the internet means it would survive a nuclear bomb with the cockroaches. Is that really true?
In the past year alone, there have been three major web outages that have seen behemoths such as Amazon, Reddit, HSBC and other household names knocked offline. Perhaps not for long, certainly nowhere near as long as Facebook was, but still offline. Far from being this huge, diverse network with ridiculous amounts of redundancy built in, 2021 has shown that even the most innocuous errors can knock out the world’s biggest businesses. The internet is perhaps more fragile than we thought.
It was as if someone had ‘pulled the cables’ from their data centres all at once and disconnected them from the internet.
Facebook falls over
The most bizarre thing about Facebook’s failure was that it wasn’t caused by a cyberattack or a rogue insider, but by an in-house mistake.
Border Gateway Protocol (BGP) is a fundamental part of internet addressing. The internet is a patchwork of thousands of smaller networks and BGP is effectively the postal service that connects them all. It helps choose the fastest route to deliver packets of data from A to B. Without BGP, the big internet routers simply wouldn’t know where to send traffic and the entire internet would collapse.
Each network (such as Facebook’s) has an Autonomous System Number (ASN), and every ASN announces prefix routes for the IP addresses it controls, or the addresses it knows how to reach, using BGP. Facebook’s problems began when it sent out an erroneous BGP update that withdrew its known routes. Facebook had accidentally told the rest of the internet that its own sites didn’t exist. So, when people typed Facebook.com, nobody knew where to find its servers. Facebook’s own DNS servers had become unreachable.
This presented a massive problem for Facebook’s engineers. “All of this happened very fast,” Santosh Janardhan, Facebook’s VP of infrastructure wrote on the company’s blog.
“And as our engineers worked to figure out what was happening and why, they faced two large obstacles: first, it was not possible to access our data centres through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.”
In the end, Facebook had to send teams into its data centres, but here they were thwarted by the company’s own security. “These facilities are designed with high levels of physical and system security in mind,” wrote Janardhan. “They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them.” All of which is why it took six painful hours to get its systems back online.
It was not possible to access our data centres through our normal means.
The ripple effect
The pain of the Facebook outage wasn’t only felt by Facebook itself. All manner of internet services hang off the social network. Many websites will have Facebook advertising beacons, which wouldn’t work because the Facebook domain was unavailable.
Many authenticate users with Facebook credentials to save them from operating their own user registration system – that too would have been inaccessible. Many businesses also depend on Facebook advertising to reach customers and would have taken a big hit to their trade that day.
Then there’s the indirect effects of Facebook, WhatsApp and Instagram blackouts. First, there’s the people who attempt to go to Facebook’s website, find it’s not there and enter the address again, and again, and again because – like Cloudflare’s engineers – they assume something’s wrong at their end. Then there’s the apps that don’t take no for an answer either and constantly retry to ping the mothership. That surge in hundreds of millions, if not billions, of people trying to reach offline sites causes a massive spike in DNS queries. Cloudflare reported its DNS servers were handling 30 times the usual number of queries in the aftermath of the outage.
The more pressure DNS servers are under, the longer it takes for people to reach websites, the more likely it is that people will hit reload and send a second request, putting even more pressure on the overworked servers. The impact of that increased load can be seen clearly in the graph above, where SamKnows data shows major sites such as Google, the BBC and YouTube were taking much longer to resolve DNS queries than normal.
The impact of the increased load created a spike in DNS resolution time
On major web sites such as Google, the BBC and YouTube
You want to make sure you gather enough data from enough vantage points around the world to confirm a global outage.
The dramatic impact of Facebook’s outage
The site went from responding perfectly normally to no response at all in an instant
Then there’s the increased load on the other social networks, where people flock to find out why Facebook, WhatsApp and Instagram aren’t working. “Hello, literally everyone,” the official Twitter account tweeted, with more than a tinge of sarcasm, when Facebook first went down. But pretty soon Twitter itself was struggling. “Sometimes more people than usual use Twitter,” the company tweeted later. “We prepare for these moments, but today things didn’t go exactly as planned. Some of you may have had an issue seeing replies and DMs as a result.”
SamKnows became aware of Facebook’s problems almost instantly. “We have specific tests for Facebook, Facebook Messenger, Instagram, WhatsApp and then lots of services within those, as well as image sharing, text sending, text retrieval and so on,” said SamKnows founder and CTO, Sam Crawford. “These run from so many devices globally – such as Whiteboxes, routers and phones – that we can effectively see an outage within seconds. We usually want to wait a minute or two to have some certainty that the outage is not localised to one particular ISP or part of the world – you want to make sure you gather enough data from enough vantage points around the world to confirm it is actually a global Facebook outage.”
One look at the graph above is enough to see how dramatic the impact of Facebook’s outage was. The site went from responding perfectly normally to SamKnows’ test pings to no response at all in an instant.
This is one reason why SamKnows is so focused on application testing rather than speed tests. A generic speed test run during this time would have reported no problems, even though one of the biggest sites in the world was down.
The bigger BGP problem
Security experts say the one positive to emerge from Facebook’s outage is that it might finally thrust some attention on the bigger problems with BGP. Facebook’s problem was caused by an error in routine maintenance that its automated software tools failed to catch. However, BGP has bigger underlying flaws. “It’s a shame that it took the Facebook BGP-related problem to surface BGP as a topic of conversation, because much more serious things have happened with BGP beforehand that really only got covered in technology publications,” said Rik Ferguson, vice president of security research at Trend Micro.
Ferguson is more worried about vulnerabilities in the BGP system that allow web traffic to be hijacked. BGP permits any of the major internet routers to announce the networks for which it knows a route. However, that system is based entirely on trust. It’s possible for a router to announce it knows how to get traffic to, say, Google and divert that traffic elsewhere.
“The net result is that if you announce routes which you are not responsible for, you can effectively hijack that traffic and cause it to pass through your routers, do whatever you want to it – inspect it, maybe even modify it, depending on the protocol – and then pass it on to where it’s supposed to be going,” said Ferguson.
Alternatively, Ferguson adds, the traffic can be “blackholed” – sent nowhere – potentially causing outages like the one we saw with Facebook.
If you announce routes you’re not responsible for, you can effectively hijack that traffic.
Facebook isn’t the only company to have suffered when a configuration update went badly wrong this year. In July, one of the world’s biggest content delivery networks (CDNs) suffered disruption when a configuration update triggered a bug in the DNS system.
The end result was that some of the world’s biggest websites and services – including HSBC, American Airlines and Steam – were down for around an hour, because once again web browsers didn’t know where to find the sites.
In this instance, the bug was relatively simple to rectify. Akamai rolled back its update and within an hour or so, normal service was resumed. But the incident once again raised questions about a single point of failure, when a fault with one CDN provider caused so many household names to temporarily vanish from the internet.
Akamai isn’t even the biggest player in town, ranking only third in terms of customer count, according to cloud intelligence firm, Intricately. The top three CDN providers – Cloudflare, Amazon Web Services and Akamai – accounted for 89% of customers in 2020 according to Intricately’s research, which shows just how centralised the market has become. If one of those major players has a problem, so do lots of websites.
That was demonstrated for a second time last summer when Fastly suffered a bizarre outage of its own in June.
Fastly’s problem was caused by an undiscovered software bug, which meant that when a Fastly customer made a valid configuration change it caused disruption across the entire Fastly network. That bug took sites such as Amazon, Reddit and the UK government website offline for three quarters of an hour, as once again a flaw with one edge cloud platform caused widespread disruption.
As with Akamai, the speed of Fastly’s response was impressive. The company claims that it had spotted the disruption within a minute, then quickly identified and isolated the cause. Within 49 minutes, 95% of Fastly’s network was operating as normal, the company claimed.
“Even though there were specific conditions that triggered this outage, we should have anticipated it,” Fastly wrote in a blog post that followed the outage, adding that it would carry out a complete post- mortem on what went wrong and re-evaluate why it didn’t catch the bug in the code.
Cause for concern?
As we’ve seen, even the biggest websites and internet services in the world aren’t immune from blackouts. But the fact that it’s worldwide headline news when a service such as Facebook goes down for even a short period shows you just how rare those outages are.
Even leaving aside the fact that those internet giants are under relentless cyberattack, being constantly probed by everything from bedroom hackers to state-sponsored hacking groups, the fact that we don’t notice more downtime is something of a miracle in itself, according to Sam Crawford.
“The thing we don’t think about is that these massive services are making thousands, tens of thousands, or possibly even hundreds of thousands of changes every day – spinning up servers, taking down servers, connecting new networks to their network, bringing new services online,” he said.
It’s an ever-changing environment on these hyperscale services that have a million plus servers on their network. The fact is that 99.99% of those are completely unnoticed by us and work perfectly. With that many changes going on every day, it’s inevitable that we’re going to see an error at some point.”
The internet is complex, centralised, fragile and yet remarkably resilient at the same time. The big sites don’t go down often, but when they do, the impact is felt far and wide.