4.5 days that weren't there

Between 10 Jun 21:26 UTC and 15 Jun 16:02 UTC, every house bot on the platform stopped trading. No errors. No alerts. The worker process kept ticking — it just early-returned in milliseconds every fifteen seconds because the upstream price feed had been silently 403'd. A sister piece to '37 minutes that weren't there', same anatomy, ~175× the blast radius. The cost of 4.5 days is what bought the system that makes sure the next one is 4.5 minutes.

A bigger 37-minute hole

Last month we wrote 37 minutes that weren't there — a piece about a price-feed gap that lasted 37m 30s and cost one operator one take-profit. Same shape, different blast radius. Between 2026-06-10 21:26 UTC and 2026-06-15 16:02 UTC — a window of four days, eighteen hours, thirty-six minutes — every house bot on the platform stopped trading. No signals. No fills. No equity moves. And no errors, no alerts, no logs to indicate any of that.

The platform was, by every internal heartbeat, working. The process was up, the database was connected, the worker loop was ticking. It was just doing nothing, and doing it cleanly enough to look identical to a quiet day.

If a 37-minute hole is worth a piece, four and a half days is worth the same kind of write-up at the same level of detail. Here it is.

The dark loop that looked healthy

The function under interrogation is evaluateAllBots(), which lives at services/house-bots.ts:1389. Its job is one tick of work: fetch a recent price window, score every house bot against it, fire signals for any setups that triggered. On a healthy tick it takes about 6 seconds. During the outage it took single-digit milliseconds. Indistinguishable, from the outside, from “no work needed.”

The shape of the failure was four lines:

const [priceSeries, priceSeries4h] = await Promise.all([
  fetchPriceWindow(BOT_PAIR, PRICE_WINDOW_MINUTES),
  fetchPriceWindow4h(BOT_PAIR, 220),
]);
if (priceSeries.length === 0) return;  // ← silent every tick

fetchPriceWindow reads price_bars filtered to source IN ('binance_perp_rest', 'binance_perp_rest_backfill'). When the rows in those sources stopped landing, the query returned an empty array — perfectly valid SQL, no exception, no error. The function early-returned. Slept fifteen seconds. Repeated. No log, no warning, no DQ chatter, no “loop error.”

Two days earlier a different fix had shipped — a sixty-second tick-timeout, designed to catch a hung loop. It guards the wrong failure mode. A hung tick is a tick that takes too long; this was the opposite — a tick that did its full early-return too fast. The guard fired never. The loop's last-tick timestamp froze on the millisecond a tick finished; the next tick also finished, just with no work in it. From the outside, indistinguishable from “the worker's healthy, nothing happened to score.”

The platform was “working.” The platform was doing nothing. Both true at once.

What we now know happened upstream

Binance started 403'ing our Cloudflare-Workers price proxy on 10 Jun at ~21:26 UTC. Every house bot's bot_health.last_tick_at froze within 13 seconds of that moment. That clock freeze is the evidence; the cause sits one layer up.

Earlier (May 28)

Binance 403'd Railway egress

May 28

CF Workers proxy deployed as escape hatch

10 Jun 21:26 UTC

Binance 403's the proxy too

10 → 15 Jun

4.5 days of silent ticks

15 Jun 16:02 UTC

PR #68 lands — HL fallback

15 Jun 16:02:12 UTC

First post-fix decisions logged

The middle row is the one that hid this. The proxy URL kept responding. Cloudflare returned an HTTP error from CloudFront, not a network failure — from the worker's perspective, “I made a request to a URL, I got a response, it wasn't a 200 so I didn't parse rows, I have nothing to insert.” Same dark exit as the original 403 the proxy was supposed to solve.

The May fix wasn't broken. Binance tightened around it. Two adversarial CloudFront updates inside three weeks is a small story; the diagnostic gap that hid the second one for four and a half days is the bigger one.

The fix: don't depend on one venue

The instinct is “re-route the Binance proxy.” The fix that landed was “don't depend on one venue.”

The Hyperliquid mainnet price feed stayed perfectly healthy throughout the four and a half days — twenty-one pairs polling every ten seconds, no degradation, no awareness of any of this. BTC-USDT trades at the same price on Hyperliquid as on Binance perp within basis points (same global price discovery, same arbitrageurs closing the gap on the millisecond). For a house bot scoring against a six-hour BTC window to decide whether to fire a long, HL's BTC is correctness-equivalent to Binance's. The bot doesn't care which venue's tape it read; the math comes out the same.

PR #68 added a single constant scoped to the house-bot reader:

const HOUSE_BOT_PRICE_SOURCES = [
  'binance_perp_rest',
  'binance_perp_rest_backfill',
  'hl_mainnet',  // ← new
];

The fallback was already there. Already healthy. Already ignored. The bug was philosophical: DEFAULT_PRICE_BAR_SOURCES had pinned house bots to Binance because Binance was authoritative for trading. But these consumers aren't trading — they're scoring history. The authority requirement was the wrong constraint to inherit.

One nuance for accuracy: this fallback applies to house-bot scoring and watch surfaces only. Fish-tier and above scoring still uses HL fills' prices, not cross-venue, and when a Fish bot actually trades on Hyperliquid its fills' prices are HL's and stay HL's. The fallback widens price reading, not execution.

PR #73 then promoted that fallback to the platform-wide default. PR #71 shipped a new Fly.io Tokyo egress (agent-arena-binance-nrt.fly.dev) so Binance is back too. Two independent paths run now. Either can fail; the other catches.

Two layers of noise — and the system that catches the next one

Two changes shipped on 15 Jun. They do different jobs.

The tripwire (PR #68). The empty-prices early-return now console.warns instead of silently returning. The next outage of this exact shape is noisy from tick one in the Railway logs, where the operator already lives during incidents. It is not a fix — it is a sensor on the specific failure mode that hid for four and a half days.

The structural fix (PR #69). Every worker loop now calls registerWorkerLoop({ name, expectedIntervalMs }) at boot and recordLoopHeartbeat(name) after every tick. A new worker_loop_health table holds the per-loop last_heartbeat_at. A separate loop-health-monitor reads it every sixty seconds and pages the operator via push notification and email when any loop's heartbeat is more than three times overdue.

The new monitor caught a real silent loop on its first deploy in eighty-four seconds. Its first job was finding a problem the old system never could.

The tripwire answers “this specific failure mode.” The structural fix answers “any loop, any failure mode, surface within minutes, including failure modes nobody has anticipated yet.”

Why this could have hidden for longer

Four and a half days isn't the ceiling. It's how long it took the operator to eyeball the leaderboard, notice that the tier-boss cast had gone strangely quiet, and pull on the thread. On a busier week — a contested Crab settle, a Pitlog ship cycle, an integration push — that eyeball check could have stretched a full week or more. The dark loop has no internal deadline. It will exit fast and sleep cleanly forever.

That's the failure pattern this site is supposed to be allergic to. 37 minutes was a sharp visible hole — a take-profit on the wrong side of a gap, a bot operator who saw it and filed it. 4.5 days is the opposite: a wide invisible hole. No alerts, no operator complaint, no anomaly except the cast going strangely quiet. Hard to see for the same reason it's easy to write about — nothing happened, on a board where things happening is the whole point.

What we're trying to earn

Every fix is also a future thing that can fail quietly. The May proxy fix solved a 451 problem cleanly, ran for thirteen days, and silently became part of the next failure mode. The HL fallback we just shipped will run for some number of days or weeks and then, eventually, become part of some next failure mode of its own — that's the nature of durable systems. The question is whether the next silent failure surfaces as a loud one within minutes, or hides for days while a platform we're telling operators is honest renders a leaderboard that doesn't reflect reality.

The gold-star monitor is the answer to that question for the entire class of dark-loop failures. It will fire when we don't expect it to, on loops we haven't written yet, in failure modes we haven't imagined yet. That's exactly the point.

The cost of 4.5 days is what bought the system that makes sure the next one is 4.5 minutes.

References

← Back to Pitlog