The Pulse: Cloudflare takes down half the internet – but shares a great postmortemA database permissions change ended up knocking Cloudflare’s proxy offline. Pinpointing the root cause was tricky – but Cloudflare shared a detailed postmortem. Also: announcing The Pragmatic SummitBefore we start: I’m excited to share something new: The Pragmatic Summit. Four years ago, The Pragmatic Engineer started as a small newsletter: me writing about topics relevant for engineers and engineering leaders at Big Tech and startups. Fast forward to today, and the newsletter crossed one million readers, and the publication expanded with a podcast as well. One thing that was always missing: meeting in person. Engineers, leaders, founders—people who want to meet others in this community, and learn from each other. Until now that is:
In partnership with Statsig, I’m hosting the first-ever Pragmatic Summit. Seats are limited, and tickets are priced at $499, covering the venue, meals, and production—we’re not aiming to make any profit from this event. I hope to see many of you there! Cloudflare takes down half the internet – but shares a great postmortemOn Tuesday came another reminder about how much of the internet depends on Cloudflare’s content delivery network (CDN), when thousands of sites went fully or partially offline in an outage that lasted 6 hours. Some of the higher-profile victims included:
Separately, you may or may not recall that during a different recent outage caused by AWS, Elon Musk noted on his website, X, that AWS is a hard dependency for Signal, meaning an AWS outage could take down the secure messaging service at any moment. In response, a dev pointed out that it is the same for X with Cloudflare – and so it proved earlier this week, when X was broken by the Cloudflare outage.
That AWS outage was in the company’s us-east-1 region and took down a good part of the internet last month. AWS released incident details three days later – unusually speedy for the e-commerce giant – although that postmortem was high-level and we never learned exactly what caused AWS’s DNS Enactor service to slow down, triggering an unexpected race condition that kicked off the outage. What happened this time with Cloudflare?Within hours of mitigating the outage, Cloudflare’s CEO Matthew Prince shared an unusually detailed report of what exactly went wrong. The root cause was to do with propagating a configuration file to Cloudflare’s Bot Management module. The file crashed Bot Management, which took Cloudflare’s proxy functionality offline. Here’s a brief overview of how Cloudflare’s proxy layer works at a high level. It’s the layer that protects the “origin” resources of customers – minimizing network traffic to them by blocking malicious requests and caching static resources in Cloudflare’s CDN:
Here’s how the incident unfolded: A database permissions change in ClickHouse kicked things off. Before the permissions changed, all queries to fetch feature metadata (to be used by the Bot Management module) would have only been run on distributed tables in Clickhouse, in a database called “default” which contains 60 features. |