The Bug That Traveled at the Speed of Light

January 15, 2023•Chad Linden•3 min read

I didn't volunteer to work on this problem.

By the time it reached me, it had already been open for weeks. Everyone knew about it. Everyone had looked at it. It was quietly burning time and credibility.

A customer in Germany used our scheduling system for delivery drivers. Every Friday at exactly 5:00 PM, they published open shifts. The moment those shifts went live, hundreds—sometimes thousands—of drivers tried to claim them at the same time. And the system broke its own guarantees.

Five open shifts would turn into eight claimed. Or ten. Multiple drivers would successfully "take" the same shift. From the customer's perspective, this was a disaster. From ours, it was worse: the system believed it was behaving correctly.

Mutexes were added. Guards were layered in. The code read like what you write when you think you're dealing with a race condition. Still, the bug persisted. It only happened under extreme concurrency, only for this customer, and only when traffic crossed the Atlantic.

I started with raw logs. Timestamps down to the millisecond. Which server handled the request. When state was read. When it was written.

On paper, the mutexes should have worked. The critical section was guarded. Only one request should have been able to claim a shift at a time. But the application was running on multiple instances. The database was Aurora MySQL. Mutexes don't cross process boundaries.

That didn't explain everything, but it explained why the code felt reassuring without actually being safe.

The failures weren't random. They clustered tightly around the publish moment. Requests arrived separated by milliseconds. Sometimes less. They all originated in Germany and hit servers in us-east.

This wasn't threads colliding in memory. This was distance.

Light in fiber travels at roughly 200,000 kilometers per second. Even in a perfect world, a Europe-to-US round trip has a lower bound of about 56 milliseconds. In the real world, Frankfurt to AWS us-east-1 is closer to 90–100 milliseconds.

During that window, dozens or hundreds of requests can already be in flight. All of them can read the same state. All of them can make the same decision. None of them have seen the others yet. Each request is locally correct. Collectively, they are wrong.

Aurora behaved exactly as designed. Isolation guarantees were honored. Transactions were consistent. The bug lived above the database. No mutex in PHP memory can close a 95-millisecond window across the Atlantic.

I traced the full read–modify–write path. I reviewed transaction boundaries. I checked isolation levels. I followed retries. The conclusion never changed. The system allowed decisions to be made on state that was valid when read and invalid by the time it mattered.

Germany                    US-East
|-------------------------|-----------------------------------------------------------------|
|--- Request A [---------->(Read=5)-->(Write=4)-->(Response=`shift.instances=4`)------------|
|-------- Request B --------->(Read=5)-->(Write=4)-->(Response=`shift.instances=4`)---------|
|------------ Request C ------------>(Read=5)-->(Write=4)-->(Response=`shift.instances=4`)--|
|-------------- Request D ----------->(Read=5)-->(Write=4)]-->(Response=`shift.instances=4`)|
|-------------------------|-----------------------------------------------------------------|
|-------------[ latency+processing Request A,B,C,D        ]---------------------------------|
|-------------------------[ Read=5                        ]---------------------------------|
|---------------------------------------------------------[Read=4                          ]|
|-------------------------|-----------------------------------------------------------------|

I walked someone else through it. Slowly. Step by step. They went quiet. Then said: "…oh."

We moved each shift's remaining instance count out of MySQL and into a single, global Redis instance. Read availability from Redis, optimistically decrement, attempt the database writes, undo if anything fails, commit if everything succeeds. One request wins. The rest see the updated value immediately and fail cleanly.

The mutexes were removed. They weren't just useless—they were misleading.

We deployed. Friday came. 5:00 PM arrived. Nothing broke.