FACEBOOK DOWN - PART 2
On Monday, October 4, 2021, for just over six hours, a large minority of people on the planet were without their foremost and primary access to the Internet – Facebook (now Meta), and all of the 80 other companies under its umbrella – which stopped loading new content, essentially going dark, for almost six hours.
For Facebook Down Part 1 - We discussed the scope and reach of this digital event as well as the cost to both people and Wall Street.
Facebook going down for six hours was merely the culmination of an unbelievably bad week for Mark Zuckerberg and the collection of companies under his corporate umbrella, now named Meta.
What happened is the direct result of a flawed process, regardless of the technology involved, but let us take a closer look at the tech side of the equation.
As reported, the outage occurred because Facebook (Meta) mistakenly instructed Border Gateway Protocol (BGP) to remove the exact physical locations of the digital addresses that reference the hosting services Facebook (Meta) currently uses to host all its vast empire.
How Did This Happen?
For all its complexity, the Internet is often described as “a network of networks.”
So, how to keep track of this ever growing, constantly evolving volume of digital traffic? Figuratively speaking, maps. Because there are lots of different internet service providers, backbone routers, and servers responsible for your data making it to a specific website and back, there are tens of thousands of different routes that data can take to reach its destination.
When you pay an internet service provider (ISP) to hook your home up to the Internet, they assign you an IP address from their own pool of registered addresses. They have already updated BGP with those IP addresses, so when the Internet is looking to send you data at your specific IP address, it gets routed to your ISP, which then looks you up in its own internal location tables and routes the traffic to your specific location.
If you are a small business, your experience is likely to be similar to a home account. If you are a corporation up to midsize, your own IT team is typically responsible for this activity.
Larger organizations, or those with hosting platforms, large scale, or disbursed IT holdings, have dedicated staff for BGP.
BGP does not make policies that guide the routing of traffic to a specific address, but it does contain the information that helps those policies make the decisions on the best way to route traffic from one point to another. Here is a YouTube video on how BGP works.
BGP Itself Was Not The Problem
The problem that brought Facebook and all 80+ organizations under the Meta umbrella offline for six hours was not a technical issue, it was a people and process failure.
The process of updating the BGP records for a company is straightforward, but the complexity grows exponentially for each set of IP addresses added. Once added, they can be updated at will. Larger organizations can update these records daily, or even more often.
For a company the size and intricacy of Facebook (Meta), with a reach as far as it has, with service points distributed across the globe, the complexity factor is huge. Tens of thousands of IP addresses, plus hundreds of staffers tasked with keeping all this computational sprawl running smoothly, it is easy to see how mistakes could happen.
So why was there no contingency plan in place? Also even if there had been a “break glass in case of emergency” protocol to look to for guidance, where was the oversight of the original procedure?
Considering that Facebook (Meta) is an immense organization, updates to the BGP records for their vast holdings must be a frequent occurrence. So, this process, which could be performed many times a day, for instance, should be well-rehearsed, well-understood, and automated. Automation is all about removing the human factor for error, but even with scripts and executables, a human is still required to push the button.
Human Oversight Is Always A Requirement
The change control process for the BGP updates (as well as most other protocols) of any organization, large or small, should always be reviewed by someone, or a control board. Within the department responsible for Facebook's (Meta) BGP updates, a review of the change prior to being executed could have prevented the disaster.
Having an approval process for standardized protocols is not a luxury, it is a necessity. Sadly there can be a disdain for following processes due to a level of mundaneness about them. Equally there is a lack of respect for the impact of the Facebook (Meta) outage. Here is a Twitter post from Nick Merrill, a cybersecurity researcher at UC Berkeley:
as far as structural outages go, this FB thing is relatively minor. a few hundreds of millions of dollars lost at most. if a similar outage hit AWS, cloudflare, akamai, etc, no one's credit card would work. i expect the losses would be in the billions if not trillions.
Mr. Merrill’s financial guess about Facebook’s (Meta) loss was off by a factor of ten. Also he was incorrect on the impact loss of others outside Facebook (Meta) organization. Merrill did not understand the interconnections and commerce dependencies Facebook (Meta) has with the general public. However Merrill is completely correct about other companies he listed. Should his second guess prove correct, we should all consider the potential outcome.
Akamai was in fact hit with an outage issue in 2021, while different technically from the Facebook (Meta) outage, the impact was felt across all industries and more to Merrill's statement. While the outage occurred, Akamai had everything back up in about an hour. They had a proper emergency procedure for such an occurrence.
When the internal process is being ignored or teams become lax in the diligence of following them, you have the mistakes of Facebook (Meta) and Akamai.
Whether you control this process for your company, or you outsource, you should ensure that there is not only a review of the process before implementation, but also a contingency plan for failure, on anyone’s behalf. Testing and reviewing all processes is a must. ConaLogix can help you determine the strategies and procedures to put in place to ensure business continuity.