We Spent Days Debugging the Wrong System (and Accidentally Fixed Minnesota's Internet)

May 15, 2024•Chad Linden•3 min read

I wasn't on the initial page.

By the time I got involved, the incident had been running for a while and everyone thought they knew what it was. The frontend was broken. Pages weren't loading, some users saw blank screens, static assets failed intermittently. It looked like a bad UI deploy, so that's where everyone went.

Managers got pulled in. A backend Java dev tried to help. Larry was in and out. Engineers would show up, poke around, disappear for a few hours, come back. This went on for a couple days. Nothing changed.

First thing I did was skip the code and open the network tab. Some requests worked. Some didn't. Refresh and it might fix itself.

That didn't sit right. If a bundle is bad, it's bad for everyone. If a deploy is broken, it fails every time. This was flickering.

We still went through all the Cloudflare stuff. Cache rules, purges, page rules, edge config. Nothing wrong with any of it.

So you start thinking maybe the bug is just subtle and you keep looking in the same spot.

Cloudflare sticks a Ray ID on every request. I usually ignore them. This time I grouped the failures by Ray ID.

Different users, different times, same Ray path.

If the UI code was broken, the Ray IDs should be all over the place. They weren't. Same failures, same route.

We asked people around the company to load the site. Different states, different ISPs, different machines. Some couldn't reproduce it. Others hit it immediately.

Failures clustered by geography. Then by provider.

A lot of traffic in and out of the Twin Cities goes through a small number of interconnects and carrier hotels. ISPs and backbone providers peer there and hand traffic off to CDNs like Cloudflare. When one transit provider in that path goes bad—packet loss, flaky peering, MTU problems—you don't get a clean outage. You get weird partial failures. Static assets fail, HTML loads, retry works, some users fine, others completely broken.

We traced the paths. Every failure went through Verizon. Not Verizon wireless customers—Verizon as a transit leg. One hop between users and Cloudflare.

Larry Patterson was our CTO, but he was also President, CTO, and Co-Founder of Atomic Data. He knew how traffic moves through Minnesota and had relationships with the people running that infrastructure. We showed him the Ray IDs, the provider correlation, the clean separation between failing and non-failing paths. Atomic Data was already looking at issues on a Verizon transit leg. What we had matched.

They pulled Verizon out of the routing mix.

The incident stopped. No deploy, no rollback, no config change. Site loaded.

Verizon having a problem isn't the interesting part. That happens. The interesting part is how long we spent debugging a system that wasn't broken. Once "frontend bug" became the assumption, that's all anyone could see. Soon as we thought about the delivery path, it was obvious.