The Internet experienced another large-scale outage on July 17th, 2020 that prevented access to many major websites and Internet-based services for about 25 minutes. The outage was the result of a problem with Internet DNS services provided by Cloudflare, impacting Internet services in many parts of the US, including the Seattle area.
Cloudflare, a company that is not a household name, provides world-wide Internet services such as DNS services, Content Delivery Networks, and Denial of Service protection for over 500,000 companies, many in the US and in the Hospital and Health Care industry. Cloudflare stressed that this was not a hacker attack, but a simple configuration error.
The outage follows similar outages over the last two years involving Cloudflare, Verizon, and T-Mobile highlighting the highly interdependent nature of today’s Internet.
Cloudflare’s company blog page provided a brief technical overview of the outage.
“… while working on an unrelated issue with a segment of the backbone from Newark to Chicago, our network engineering team updated the configuration on a router… This configuration contained an error that caused all traffic across our backbone to be sent to Atlanta. This quickly overwhelmed the Atlanta router and caused Cloudflare network locations connected to the backbone to fail.” This simple error by a single engineering team impacted 50% of the traffic on their network.
These outages provide an insight into issues that everyone in the Health Care IT and Communications communities should consider when planning the resiliency of their critical or essential services. Today’s Internet provides critical links in our IT and communications infrastructures which we may not even consider until the service is unavailable. A simple example; internal pager services often use an email gateway interface to your paging vendor often sends this traffic over the Internet. As these nationwide outages demonstrate it is critical that we are aware of the critical links in the chain that makes up these services and plan alternate means of providing the service during a failure event. See the PUSHECS “Ideas and Concepts” page for more information on PACE planning and Resiliency Planning.