How we handled downtime caused by Cloudflare outage

As a cloud-based SaaS provider, we rely on a robust infrastructure to keep our web application running seamlessly. One key player in that setup is Cloudflare, which manages our DNS, firewall, and CDN, ensuring security and speed.

Recently, we found out firsthand. Cloudflare encountered a routing issue, causing a significant disruption. Our users were suddenly unable to access the service, and traffic to our application came to a halt. Faced with this challenge, we acted fast. To restore access, we made the decision to bypass Cloudflare’s firewall and CDN, and it worked—the app was back online.

But this was just the beginning. As soon as service was restored, the queued HTTP requests flooded our servers all at once. The sudden traffic spike overwhelmed the system, impacting performance. While we typically rely on autoscaling to handle such surges, waiting for it to take effect would have taken too long.

Instead of waiting, we took a proactive approach and manually provisioned additional servers to absorb the load. This immediate action stabilized performance, and our users experienced minimal disruption.

In situations like these, every second matters. While automation is a powerful tool, sometimes manual intervention is the best way to ensure smooth operations when the unexpected happens.