The incident, in three numbers

On 2026-05-06 09:23:34 UTC, our BTC-USDT price feed recorded its last tick. The next tick arrived at 10:01:04 UTC, 37 minutes and 30 seconds later. During that window, every BTC signal that hit our matching engine got rejected with PRICE_FEED_STALE — the engine's standard guard against filling against price data older than 10 seconds.

The Stuber, our highest-rated bot, was holding a long BTC position and watching for its take-profit at $81,976.47. At 09:40:06 UTC, deep inside the gap, BTC reached that price (Binance archive confirmed). Stuber's bot fired its CLOSE signal correctly. The matching engine rejected it 390ms later — there was no live tick to fill against.

The bot received a queued response at the API layer (the row had successfully landed in our queue), saw what looked like a successful submission, and cleared its local state. The async rejection was invisible to it. The position stayed open. Stuber spent the next 48 hours holding what should have been a closed winner.

What we found in the data

When Stuber-side's owner flagged it via a GitHub issue, our first read was that the signal had been "stuck queued." It hadn't —queued → rejected happened in 390ms. The 48-hour gap wasn't the queue stalling. It was the bot, having received queued from the API, never learning that the engine had then refused to process the signal.

We pulled price-tick coverage for the surrounding window and confirmed the gap: zero ticks for any pair (BTC, ETH, SOL, PAXG) between 09:23:34 and 10:01:04. Not a Binance-side BTC issue — our own price feed worker had stopped publishing for everything. The matching engine correctly rejected every signal in the window. The engine wasn't broken. The feed was.

We also pulled the 24-hour rejection breakdown across all bots: 6 rows total in this class, all from the same window. No other bot's position was orphaned the same way Stuber's had been — we audited that too, by walking forward from each rejected CLOSE in the window and confirming the position was no longer open. Stuber was the only operator left holding the bag.

The decision: backfill, with witnesses

Three policy answers exist for "the platform's bug materially cost an operator":

"Sorry, T&Cs say platform isn't liable." What most exchanges do most of the time. Cheap. Erodes trust.
"We'll fix it for you, quietly, behind the scenes." Better, but builds a hidden two-tier system: loud operators get fixed; quiet ones don't.
"We'll fix it transparently — script, audit row, public commentary line, the whole chain." What we did.

(3) is the only one that scales when there are 100 operators, because anything else either treats them inconsistently or hides the inconsistency. We picked (3) before we knew which way this particular case would go — that was the framing that made the right call automatic when the data came in.

The data: Stuber's TP hit during the gap, externally verifiable against Binance's public-API klines. A backfill at the TP price ($81,976.47, minus standard 2bp slippage and 2bp taker fee — same cost structure as every other fill on the platform, no special carve-out) settled the trade as if the feed had been live. Realised PnL: +$7,030.59 gross, +$6,918.23 net. The bot moved from holding a -$9,051 unrealised loss to a +$6,918 realised win — a $16k swing — and the reason for the swing is fully auditable.

The audit trail isn't shy about this. The original rejected/PRICE_FEED_STALE signal row is preserved untouched as the record of what the engine actually did. The backfill writes a new fill, a new equity snapshot, a new admin audit row pointing at this issue, and a venue-voice commentary line on the leaderboard's commentary feed. All five artifacts are tied together by the original signal's ID, so anyone reading the database can reconstruct the chain.

What we changed (so it doesn't happen the same way next time)

Auto-retry on transient rejections. The matching engine now treats PRICE_FEED_STALE as a soft reject — up to five retries with a 2-second cooldown before giving up. Doesn't help with a 37-minute outage (retries during a long gap just hit the same dead feed) but does fix the much more common "100ms-window unlucky timing" case where the bot was right and the engine was a beat behind.

Sync POST + GET endpoint. The signal API used to return queued immediately after insertion — the bot had no way to learn whether the engine subsequently rejected the signal without polling. Now the API blocks up to 5 seconds for the engine to reach a terminal state, returning the actual outcome 95% of the time. If the engine doesn't catch up in 5s, the response carries in_flight and the bot can poll GET /api/v1/signals/:id for the terminal state. Closes the structural visibility gap that caused the orphan.

Soft-cancel. DELETE /api/v1/signals/:id lets a bot abandon an in-flight signal cleanly — useful when the price has drifted during a retry window and the original intent is no longer valid. Idempotent, race-safe, returns the terminal state in every case so the bot doesn't need a follow-up GET.

Bot-side patch. Stuber's owner shipped a symmetrical fix on the bot side: don't clear local state when a CLOSE is emitted; clear it when the position is observed flat. Plus orphan recovery — if a position exists but the bot has no memory of it (post-restart, post- rejected-CLOSE), install a conservative fallback stop so the position is never silently un-watched. The pattern is now documented in the bot starter repo as a drop-in for any author tracking position state locally.

The outage itself remains the next problem. Auto-retry and bot-side state hardening don't make a 37-minute feed gap fillable — they make sure the failure is loud and recoverable. Feed redundancy (a Bybit failover when Binance's WS drops) and outage alerting (page on no-ticks-for-N-seconds) are tracked as a separate engineering surface. We'd rather know within 2 minutes of an outage than 48 hours after, regardless of how good the recovery layer gets.

The sentence we're trying to earn

"BotPit makes mistakes honestly."

That's the sentence. It's not a marketing claim — it's a policy that has to be redemonstrated every time something breaks. This time it cost us +$6,918 to a single operator's ledger, plus the engineer-hours behind the platform fix. Next time it'll cost something else. The point is that the bill gets paid in the open, by the platform, with the operator made whole and the audit trail intact.

That's the deal we want operators and copy-traders to know they're getting before they put real capital behind a bot we ranked.