When there's a big internet outage, there are two major suspects:
Cybercrime
Cloudflare.
Or maybe that should be in the other order. What we know so far is that at around 18:19 UTC, or approximately 2:19 p.m. Eastern Daylight Time, there was a sudden spike in trouble reports and apparent network outages.
Downdetector makes a pretty picture out of it.
Of course, with the recent unpleasantness in the Middle East, people immediately jump to the conclusion that it has something to do with Iran or Israel, and it was indeed somewhat suspicious that India was having a lot of troubles, and Pakistan wasn't.
It looks, however, like the problem was actually option 2 — Cloudflare.
If you want to skip to the chase, it comes down to this:
(17:19 UTC 2:19 p.m. EDT) Cloudflare engineering is investigating an issue causing Access authentication to fail. Cloudflare Zero Trust WARP connectivity is also impacted.
(19:57 UTC 4:57 p.m. EDT) Cloudflare had a critical service called Workers KV that depended on an unnamed third-party, and that third-party service went down.
(20:57 UTC 5:57 p.m. EDT) All Cloudflare services have been restored and are now fully operational. We are moving the incident to Monitoring while we watch platform metrics to confirm sustained stability.
Now for some of the details.
You may recall a previous big outage in June 2022 that also looked momentarily like a purposeful denial-of-service attack but turned out to have been a programmer error. The cause was Cloudflare, ironically, trying to make their network protections more robust.
The problem this time appears to have been centered around Cloudflare's Zero Trust WARP connectivity manager and Access identity-based authentication products, all part of the overall Zero Trust platform.
This immediately raises the question of how a failure in one company can hit so many websites. To understand this, you need to understand what Cloudflare does.
One of the major issues on the internet just a few years ago was distributed denial of service. Basically, any web server has limited resources it can apply to serving a website's request — things like the number of concurrent connections, the amount of compute power it can provide, and the number of server processes it can manage. A really primitive way of attacking a website was to just write a program that kept making requests of the site you wanted to attack. The technical term for this is "banging the hell out of it."
This was relatively easy to deal with because you could identify the attacker and refuse to connect to it.
Real old-timers will recall something called an Instalanche, when Glenn Reynolds at Instapundit would link an interesting story on a blog, and people from all over would try to read it at the same time. It happened to me when I was compiling a list of the scurrilous rumors that were circulating about Sarah Palin.
The usual methods of dealing with a "banging the hell out of it" attack didn't work, because it wasn't just one system participating in the "attack" — it was thousands or tens of thousands of systems that just all wanted to look at the same web page at the same time.
Hackers in black hats used that same approach and would attack a site by simulating an Instalanche by compromising a whole load of other systems and using each compromised system to make many requests. This is called a "distributed denial of service" or DDoS attack.
What Cloudflare was started to do was provide protection against DDoS attacks, using various crafty methods. This grew into a whole suite of services that Cloudflare could provide at a cost that was sustainable even for a small website. The result has been that Cloudflare dominates the market, with some estimates saying that it handles as much as 50% of the internet.
So it appears that Cloudflare's services through its Zero Trust WARP and Access services were failing. (No, there's no evidence that "zero trust" was chosen ironically.) As a result, internet service for that chunk of the market — potentially half of it — was disrupted.
Cloudflare has updates coming at cloudflarestatus.com, which has a tick-tock listing of their progress. The most interesting update is from 19:57 UTC (4:57 p.m. EDT) saying:
Update - Cloudflare’s critical Workers KV service went offline due to an outage of a 3rd party service that is a key dependency. As a result, certain Cloudflare products that rely on KV service to store and disseminate information are unavailable including:
Access
WARP
Browser Isolation
Browser Rendering
Durable Objects (SQLite backed Durable Objects only)
Workers KV
Realtime
Workers AI
Stream
Parts of the Cloudflare dashboard
Turnstile
AI Gateway
AutoRAG
Cloudflare engineers are working to restore services immediately. We are aware of the deep impact this outage has caused and are working with all hands on deck to restore all services as quickly as possible.
As of 20:57 UTC (5:57 EDT), Cloudflare says:
All Cloudflare services have been restored and are now fully operational. We are moving the incident to Monitoring while we watch platform metrics to confirm sustained stability.
So, disaster apparently averted. And the question of the wisdom of so much of the internet depending on one company can be put off to another day.
Editor's Note: President Trump isn't going to allow lawlessness to reign in America. We will not have a repeat of 2020's "Summer of Love."
Help PJ Media continue to report on the president's crackdown on rioters in Los Angeles and expose the truth about the violent left trying to destroy our great country. Join PJ Media VIP and use promo code FIGHT to get 60% off your membership.
Join the conversation as a VIP Member