Cloudflare Outage: Massive Global Internet Disruption and Successful Recovery

An in-depth overview of the November 18, 2025, Cloudflare outage, detailing its cause by a latent configuration bug, the widespread impact on global internet services, Cloudflare's swift recovery efforts, and an educational explanation on the technical causes and best practices to prevent such incidents

On November 18, 2025, Cloudflare, one of the world’s largest providers of internet infrastructure for security and performance, experienced a major outage with a widespread impact on hundreds of millions of internet users globally. This outage caused popular sites such as X (formerly Twitter), ChatGPT, streaming platforms, and cryptocurrency services to become temporarily inaccessible. It marked one of the most significant disruptions Cloudflare has faced since 2019.

Incident Timeline and Root Cause

The disruption began at approximately 11:20 UTC when Cloudflare’s network started failing to deliver core HTTP web traffic, showing widespread 5xx error pages to users attempting to access sites protected by Cloudflare.

The root cause was not a cyberattack or malicious activity but a latent bug triggered by a configuration change in one of Cloudflare’s database systems—specifically the ClickHouse database used by its Bot Management system. The change caused the database to output multiple redundant entries into a "feature file" that grew to twice its usual size.

Cloudflare’s core proxy software, responsible for routing customer traffic, has a fixed-size limit for this feature file. When the oversized file was propagated across Cloudflare’s global network, it caused the software to fail, generating widespread HTTP 5xx errors and network crashes.


Impact of the Outage

Several key Cloudflare services were affected:

  • Content Delivery Network (CDN) and security protections returned 5xx errors, disrupting web traffic globally.
  • Authentication services, such as Cloudflare Access, failed, preventing user logins for many.
  • Workers KV, a key-value store system, experienced elevated error rates due to dependence on the failing core proxy.
  • Cloudflare’s own Dashboard and Turnstile login flow were also temporarily inaccessible due to these backend failures.

This resulted in a degraded and interrupted internet experience lasting for several hours.

Recovery Efforts and Resolution

After thorough investigation, Cloudflare identified the problematic oversized configuration file as the source. At 14:30 UTC, they halted the generation and propagation of the bad file and restored a previously stable version into the system.

The affected core proxy services were restarted to clear out corrupted states and reload the corrected configuration file. Traffic began flowing normally again by 14:30 UTC, and full restoration of all systems was achieved by 17:06 UTC.

Cloudflare’s leadership issued a formal apology, acknowledging the critical nature of their infrastructure and pledging measures to prevent recurrence.


Causes and Prevention of Such Outages

Outages like this often result from technical conditions such as:

  1. Inadequately Tested Changes
    Software updates, configuration changes, or database queries deployed without exhaustive testing may introduce unexpected bugs that only surface in production.
  2. Exceeding System Limits
    Systems have resource limits such as file size, memory use, or CPU. Breaching these limits, as happened with the oversized feature file, can cause crashes.
  3. Unfiltered and Erroneous Data Inputs
    Buggy queries or changes that generate excessive or malformed data can disrupt downstream systems relying on that data.
  4. Complex Interdependencies
    Modern cloud infrastructure involves many tightly coupled components, meaning failure in one can cascade and amplify effects.

To mitigate these risks, organizations should:

  • Implement rigorous pre-deployment testing covering edge cases and load conditions.
  • Use real-time monitoring and alerting to detect anomalies early and trigger swift incident response.
  • Enforce hard limits and validation on configuration files and data inputs to avoid overloading systems.
  • Maintain redundancy and rollback mechanisms so faulty changes can be quickly reverted.
  • Conduct periodic audits and code reviews to identify latent risks.
  • Train teams thoroughly on incident management protocols and operational best practices.

The November 2025 Cloudflare outage highlights the complexity and fragility behind widely used internet infrastructure. Despite the disruption, the company's prompt identification and correction of the issue minimized the outage duration and impact. Continuous improvement in safeguards and preparedness remains essential in an increasingly connected world.


# news