artifacts/toby-backend-doctor/083ec6d2-63e9-4c3e-b55e-a95301a4aa72/finding.mdBackend finding — blank-extension-page (2026-05-11)
Reproduced (from backend side)?
no — the backend is not implicated. The Go API has been
deployed on the same commit since 2026-04-01 with a sub-0.002 %
error rate, ~1M req/day of 2xx, and zero correlated spike at or
after 2026-04-09. The SW boot path has no listener-after-await bug.
The hang lives in the renderer (toby.html tab) — almost certainly
Chrome's "extension context invalidated" state dropping
chrome.storage.local.get callbacks, which is a Chromium MV3
platform behaviour, not anything we ship on the Go side.
Symptoms observed in logs / SW / API
- Prod-api is stable.
prod-api-00427-9p2(revisioncommit-sha=4b0107858e706c904e6cf2841fbcbf81a1e2f94f) has been the active prod revision since 2026-04-01T20:08Z. The two before it (00425, 00426) ran the same SHA, so the last actual code change to deployed prod-api is older than 2026-02-02 — well before the user-complaint window the frontend doctor narrowed to post-2026-04-09. - Error volume is rounding error. Last 24 h: 0 5xx in prod
metrics (Cloud Run aggregated
request_count,response_code_class=5xx); 23 ERROR-severity entries in stderr, of which 22 are expected 401s on stale-sessionPUT /v2/users/activity/GET /v2/statesand 1 is a downstreamtoby-ai-api 500on a cache-warmUpdateUserAICachecall — none are on the user-hydration path. - 503 noise is normal cold-edge churn, not a regression. 51 ×
Cloud-Run "malformed response or connection error" 503s in the
last 36 h are spread across
/v2/states,/v2/surveys,/v2/users,/v2/public/announcements— all post-hydration endpoints. They cannot cause the pre-hydration blank-page hang. - No panics, fatals, or CRITICALs in the prod-api log stream since 2026-04-01.
- DB is healthy.
users(Toby Prod, read-only) shows 41,578 last-active in the past 24 h, 59,654 in the past 7 d, 720 new signups in the past 7 d — i.e. the API is successfully serving the hot user-activity path right now. - SW boot path is well-formed. Every
chrome.*.addListenercall inapps/extension/app/background/*.tsregisters synchronously at module top level (or synchronously inside a non-awaitinginit()). I could not find a listener-after-await MV3 boot-bug pattern. - Min-instances=4 on prod-api → cold-start hangs are not a factor for the request volumes in play.
Root cause (best hypothesis, backend lens)
Confidence: high that the proximate fault is not on the backend. The chrome.storage callback drop the frontend doctor identified is a Chromium MV3 renderer-side phenomenon ("extension context invalidated") that occurs in the toby.html tab, not in the SW or our API:
- Chrome auto-updates the extension while the user has a
long-lived new-tab open (or the user reloads the extension
themselves). Chrome marks the old tab's extension context
invalidated; subsequent calls to
chrome.storage.local.getfrom that tab return without invoking the callback (lastError is set, butgetUser()atapps/extension/app/state/accessors/user.tsx:45-50never readschrome.runtime.lastError, so the dropped callback is indistinguishable from "just slow"). - The frontend's
d68726b29commit (2026-04-09) widened theAuthWrappergate atapps/extension/app/containers/Toby.tsx:304to depend onisUserHydrated, which is bound 1:1 to that one unbounded storage callback. The same chrome.storage drop now reliably produces a blank page where it used to either work or only flicker. - The Go API is structurally incapable of causing this drop:
getUser()does not make a network call. It reads from chrome.storage. The very first network request toprod-apiwould happen only afterisUserHydratedflips true. The hang precedes any HTTP request.
What the backend doctor can refute, with evidence:
- Not an API regression — code is from 2026-02-02 or earlier (commit SHA 4b0107858 has been deployed on three consecutive revisions back to revision 00425, 2026-02-02; the deploys since are config-only redeploys of the same SHA).
- Not a 5xx storm — 0 of either 500 or 5xx-class in last 24 h
(
response_code_class="5xx", hourly time-series, all zero). - Not a CORS / cookie / CSP change — no commits to
apps/api/middlewares/,apps/api/routes/, or any session handling code in 2026. - Not a database collapse — 41k DAU, healthy hourly distribution, no row-count anomaly.
- Not a Cloud-Run cold-start race — min-instances=4 keeps the service permanently warm; the 503 "malformed response" blips are tiny (51 / 36 h) and don't touch the auth path.
Evidence
GCP logs — error volume since 2026-05-04
Filter (gcp-observability list_log_entries):
resource.type="cloud_run_revision"
AND resource.labels.service_name="prod-api"
AND severity>=ERROR
AND timestamp>="2026-05-04T00:00:00Z"
Distinct error messages, count, all 23 entries:
8 "401 PUT /v2/users/activity"
8 "401 GET /v2/states"
5 "failed to search nextItems"
1 "failed to send AI request" # downstream toby-ai-api 500
1 "401 POST /v3/nexts/search"
All 401s. No panics. The one 500 is downstream toby-ai-api from
services/ai/ai_service.go:171 UpdateUserCache — not on the auth
or hydration path.
GCP metrics — 5xx by day (run.googleapis.com/request_count)
2026-05-10 → 2026-05-11 500-class: 0 2xx: 1,108,508
2026-05-09 → 2026-05-10 500-class: 4 2xx: 560,996
2026-05-08 → 2026-05-09 500-class: 0 2xx: 718,076
2026-05-07 → 2026-05-08 500-class: 19 2xx: 1,186,238
2026-05-06 → 2026-05-07 500-class: 8 2xx: 1,265,714
Worst day = 19 / 1.18 M ≈ 0.0016 %. No spike pattern.
Cloud-Run revisions (gcloud run revisions list --service=prod-api)
prod-api-00427-9p2 2026-04-01T20:08Z commit 4b01078 ACTIVE
prod-api-00426-s6l 2026-04-01T17:58Z commit 4b01078
prod-api-00425-x7f 2026-02-02T19:13Z commit 4b01078
prod-api-00424-dpx 2026-01-28T20:16Z commit 0b44adc
The deployed SHA hasn't actually changed since at least 2026-02-02. The 2026-04-01 deploys are config redeploys of the same SHA. Code on the wire today is older than the bug window.
Cloud-Run config (gcloud run services describe prod-api)
autoscaling.knative.dev/minScale: "4"
autoscaling.knative.dev/maxScale: "20"
containerConcurrency: 900
run.googleapis.com/cpu-throttling: "false"
run.googleapis.com/startup-cpu-boost: "true"
Always 4 warm instances; cold-start cannot explain a recurring user-visible issue.
DB query — Toby Prod (Read Only) be55a66b-c905-4759-9ce1-a97785bb69e6
SELECT
COUNT(*) AS total_users,
COUNT(*) FILTER (WHERE created_at >= NOW() - INTERVAL '7 days') AS new_7d,
COUNT(*) FILTER (WHERE created_at >= NOW() - INTERVAL '1 day') AS new_1d,
COUNT(*) FILTER (WHERE last_active >= NOW() - INTERVAL '1 day') AS active_1d,
COUNT(*) FILTER (WHERE last_active >= NOW() - INTERVAL '7 days') AS active_7d
FROM users WHERE deleted_at IS NULL;
total_users new_7d new_1d active_1d active_7d
1,046,842 720 93 41,578 59,654
And hourly distribution (last 12 h):
2026-05-11 17:00Z 943
2026-05-11 16:00Z 2,427
2026-05-11 15:00Z 2,740
2026-05-11 14:00Z 3,015
2026-05-11 13:00Z 3,352 ← peak
2026-05-11 12:00Z 2,877
2026-05-11 11:00Z 2,072
2026-05-11 10:00Z 1,418
2026-05-11 09:00Z 1,443
2026-05-11 08:00Z 1,736
2026-05-11 07:00Z 2,018
2026-05-11 06:00Z 1,853
Healthy diurnal curve. No collapse, no flatline. The API is processing this traffic right now.
SW source — apps/extension/entrypoints/background.ts
import { defineBackground } from 'wxt/utils/define-background';
import { persistQueryClientRestore } from '@tanstack/react-query-persist-client';
import { IDBpersister, queryClient } from '~/state/client';
// These register chrome listeners as side effects
import '~/background/contextMenus';
import '~/background/inject';
import '~/background/badge';
import '~/background/onInstall';
import '~/background/tobyLinks';
import '~/background/omnibox';
export default defineBackground(() => {
persistQueryClientRestore({
queryClient,
persister: IDBpersister,
maxAge: Infinity,
hydrateOptions: { defaultOptions: { queries: { gcTime: Infinity } } },
});
});
The side-effect imports happen before defineBackground is
called, so they register listeners synchronously at module
evaluation. No MV3 boot-bug.
SW listeners (all top-level / no await before addListener)
apps/extension/app/background/onInstall.ts:9—chrome.runtime.onInstalled.addListener(sync, top-level).apps/extension/app/background/inject.ts:140,204—chrome.tabs.onUpdated.addListener,chrome.runtime.onMessage.addListener(both sync, after onlyconsts).apps/extension/app/background/badge.ts:4-5—chrome.action.setBadge*(sync, top-level; no listeners).apps/extension/app/background/omnibox.ts:1—chrome.omnibox.onInputEntered.addListener(sync, top-level).apps/extension/app/background/tobyLinks.ts:5-81—chrome.webRequest.onBeforeRequest.addListener× 2,chrome.tabs.onUpdated.addListener(sync, all inside anif (isFireFox())gate; Chrome users skip this entirely).apps/extension/app/background/contextMenus.ts:217-261— declaredasync function init()but noawaitanywhere before the three.addListenercalls on lines 218, 219, 231. Listener attachment is therefore synchronous. ✓
The pattern check confirms what the prod metrics already imply: even when SWs do die and restart, the listeners re-attach correctly. The MV3 boot path isn't the wound.
SW shortcomings I did find (defence-in-depth, not the root cause)
apps/extension/app/background/contextMenus.ts:145-163— fetches${env.API_URL_V3}/cardsfrom the SW with noAbortController/ no timeout. If the network stalls during a context-menu save and the SW idles past 30 s, the in-flight fetch keeps the SW alive until tcp resets — not a crash, but means SW lifetimes are jittery on bad networks.apps/extension/entrypoints/background.ts:14-19—persistQueryClientRestore(...)insidedefineBackgroundis fire-and-forget with no.catch(). If IDB throws on cold open, it's silently swallowed; the queryClient stays empty and any SW callsite reading from it gets stale-empty data. This is the same IDB-availability surface the frontend doctor flagged foruseIsRestoring()— except in the SW it's even less visible because the SW has no UI.apps/extension/app/state/client.ts:24-29—IDBpersister.restoreClientcallsget(idbValidKey)fromidb-keyvalwith no error path. If IDB is locked or quota-exceeded, the promise rejects deep insidepersistQueryClientRestore. Surface ditto.
None of those cause the blank page. They explain why the "silent" failure mode is so silent.
Recent commits — Go API since 2026-01-01 (everything)
b9bea18c Post fallback Slack message when CWS review AI draft fails (#10)
4666eefc Merge pull request #4 (CWS review monitor)
cbc92a78 Make retention discount eligible for all cancellation reasons
ba247d9a Add Chrome Web Store review monitor with AI-drafted responses
0727449d Migrate landing page CI and remove dead workflows
5bd96126 Add Turborepo configuration
7effe8b6 Refine legacy user discount logic in subscription context (#876)
81f1ab57 Refactor retention offer logic (#875)
a462d067 Enhance user email verification logic … token validation (#874)
36fe2451 Refactor retention offer logic to utilize Stripe coupon (#872)
52f51924 Enhance retention offer logic and legacy user support (#871)
bd6f64a8 Add retention discount functionality (#869)
9fb43ddd Fix build error from last commit (#870)
7c7a1cdf Remove Legacy Discount from getCoupon (#868)
None of these touch apps/api/middlewares, apps/api/routes,
session/cookie handling, or anything the extension hits on boot.
a462d067 is the only one in user-verification territory but it
predates the latest prod deploy (2026-04-01) and is on a Stripe
verify path, not the SW-relevant chrome.storage seed.
Proposed fix (backend / SW side)
There is no backend-side fix that resolves the user-visible bug, because the bug is not on the backend. The frontend doctor's Layer 1 (timeout the hydration promise) and Layer 2 (recovery UI) are the correct, sufficient fixes.
That said, three SW-side hardening items are worth shipping alongside the FE fix, because they remove ambiguity from future incidents of this shape:
-
Catch and log the persist-restore failure in the SW.
apps/extension/entrypoints/background.ts:14:export default defineBackground(() => { - persistQueryClientRestore({ + persistQueryClientRestore({ queryClient, persister: IDBpersister, maxAge: Infinity, hydrateOptions: { defaultOptions: { queries: { gcTime: Infinity } } }, - }); + }).catch((err) => { + console.error('[toby-sw] persistQueryClientRestore failed', err); + }); });Zero functional change; just makes IDB failures visible in
chrome://extensions → service worker → console. -
Bound the SW
fetch(/v3/cards)andfetch(/lists)paths with an AbortController.apps/extension/app/background/contextMenus.ts:145-191. A 10 s timeout + abort. Prevents the SW from being kept alive by a stuck TCP socket on bad networks. -
Promisify the SW's
chrome.storagereads with alastErrorcheck at the source. Build a singlechromeStorageGet<T>(keys, { timeoutMs })helper that:- reads
chrome.runtime.lastErroron every callback, - rejects on a timeout,
- rejects if
chrome.runtime.idis undefined (the canonical signal that the extension context is invalidated).
Replace every raw
chrome.storage.local.get(key, cb)inapp/state/accessors/*.tsxandapp/utils/chromeapi.ts:248with that helper. This is the real defence — the FE's Layer 1 only handles thegetUsersite; this helper handles every site uniformly. (It belongs in the FE codebase; from the backend lens, I'm flagging it because the chrome.storage drop is a class of bug, not a one-off.) - reads
I would NOT redeploy prod-api as part of this incident — the API code is innocent and a redeploy is a needless blast-radius increase.
Verify plan
-
Confirm prod-api stability over the next 24 h. Re-run the 5xx and ERROR-severity log queries above; expect the same near-zero counts. No action if so.
-
After FE Layer 1 ships, watch for the new
NewTabHangShownbeacon the frontend doctor recommended (Layer 3). If it fires at non-trivial volume without a corresponding 5xx spike in prod-api, that confirms the chrome.storage drop is platform-side and we ship Layer 1+2 as-is. -
If the beacon volume correlates with a Cloud-Run revision bounce (use
gcloud run revisions list --service=prod-apito spot any 5xx jump on a new revision), re-open the backend investigation. With the current cadence (one prod-api code change in 4 months) this is unlikely. -
Add a synthetic SW health probe. A Cloud Scheduler job that hits
https://api2.gettoby.com/v2/stateswith a throwaway token every 5 min and alerts if 5xx > 2 % over a 15 min window. Right now we have no signal between "user complains in CWS review" and "Sentry/Amplitude burn". Out-of-scope for this incident; file as follow-up. -
Optional repro for SW context invalidation. In a dev profile: open
chrome-extension://<id>/toby.html→ chrome://extensions → toggle the extension off and back on. The open tab should now havechrome.runtime.id === undefinedand anychrome.storage.local.getfrom it returns without invoking the callback. This is the canonical scenario; if FE Layer 1 rescues the page after 5 s in this state, the fix is verified.
Confirm or refute frontend hypothesis
Refine, then confirm. The frontend doctor wrote:
If the MV3 service worker is dead or the extension context is invalidated …
chrome.storage.local.getcan return without invoking its callback.
The "MV3 service worker is dead" half is unlikely to be the
cause in isolation. chrome.storage.local.get reads are handled
by the Chrome browser process, not by the extension SW — a dead
SW alone does not drop callbacks; Chrome will quietly fulfill the
storage read either way. What does drop callbacks is the
renderer (the toby.html tab) being in the
"extension context invalidated" state. That state is most
commonly entered when:
- Chrome auto-updates the extension while a chrome_url_override new-tab is open (Toby owns the new-tab page → this happens to every Toby user, every time we ship an update);
- the user manually disables/re-enables or reloads the extension;
- the SW crashes during a critical phase that invalidates the
handshake (rare, but possible — and the SW IS more crash-prone
than it should be because of the unhandled
persistQueryClientRestorerejection and the un-aborted fetches I flagged above).
So the chrome.storage callback drop is real, the FE-side fix is correct and sufficient, and the SW-hardening items in §"Proposed fix" reduce the frequency of the drop (by reducing renderer/SW invalidations) without changing the severity of any single occurrence. Ship FE Layer 1+2 regardless; backend SW hardening is a nice-to-have follow-up.
The frontend doctor's hypothesis that there was a "Manifest V3
service-worker boot regression" — proposed by
toby-product-strategist artifact 388c1db4 — is not
supported by the evidence. The SW boot code is well-formed
(listeners synchronous, no await-before-addListener), and prod-api
hasn't changed in months. The post-2026-04-09 sensitivity is
purely the FE widening the gate; the underlying chrome.storage
drop is an evergreen Chromium MV3 phenomenon that the FE used to
tolerate accidentally.