On November 2nd, 2023, Cloudflare, a major internet infrastructure and security company, responsible for about 20% of the web’s traffic, experienced a significant outage that affected its control plane and analytics services. The outage lasted from November 2 at 11:44 UTC until November 4 at 04:25 UTC, leaving multiple people wondering what was going on and concerned.
Today, they posted about the incident, explaining what happened, but since the post was extremely massive, I thought it’d be nice to make a shortened version of it, for our friends with ADHD.
What Happened?
- Power Failure at Data Center Provider: The incident started when Cloudflare’s data center provider, Flexential, faced an unexpected power issue. This event caused a major disruption in Cloudflare’s services, affecting their control plane and analytics systems.
- Lack of Communication: Flexential’s decision to switch to generator power was not communicated to Cloudflare. This created a situation where Cloudflare was unaware of the changes in the power source, preventing them from taking immediate action to monitor the situation.
- Ground Fault and Transformer Failure: At approximately 11:40 UTC, a ground fault occurred on a power transformer at Flexential’s data center. This fault likely resulted from the prior power issues and maintenance by the utility company, Portland General Electric (PGE).
- Complete Power Failure: Ground faults in high-voltage power lines are serious and led to the shutdown of all generators at the data center. Consequently, both the utility power and backup generators were offline, causing a complete power outage.
- Activation of Disaster Recovery: In response to the outage, Cloudflare activated its disaster recovery site in Europe. This site is designed to provide essential control plane services during critical incidents.
- Dependency Challenges: Some services couldn’t be restored promptly due to complex dependencies, particularly Kafka and ClickHouse. These dependencies were tightly integrated, which complicated the recovery process.
- Relentless Efforts: Cloudflare’s teams worked tirelessly to restore services, including rebuilding configuration management servers and rebooting thousands of servers. This involved a manual process that took several hours.
What will change
Cloudflare, as a company, always tries to learn from what happened and change things around(like during Cloudbleed). To avoid another situation like this, they decided on the following changes:
- Reducing Dependency on Core Data Centers: Cloudflare aims to reduce its reliance on core data centers by distributing control plane functions across its network. This strategy enhances resilience by minimizing the impact of data center failures on service availability.
- High Availability and Disaster Recovery Plans: All products and features must adhere to high availability standards and have robust disaster recovery plans. These plans will be rigorously tested to ensure that services can be quickly restored in case of an outage.
- Audit and Chaos Testing: Cloudflare plans to conduct comprehensive audits of its core data centers to ensure they meet the company’s high standards for reliability. Additionally, the company will implement rigorous chaos testing to assess the resilience of its systems.
- Logging and Analytics Disaster Recovery: A dedicated plan will be put in place to ensure the integrity of log and analytics data even in the event of core facility failures. This plan will prevent data loss and minimize disruptions.
Conclusion
The recent outage taught them a valuable lesson about being better prepared for catastrophic failures. It’s a good demonstration and reminder for companies that, no matter how big they are and how many failover systems there are, unexpected things can still cause everything to go wrong. Backups, varied testing, redundancy, etc go a long way.