A
AIOS Wiki
read-only · public mirror
Open AIOS
Wikiartifactstoby-backend-doctor083ec6d2-63e9-4c3e-b55e-a95301a4aa72artifacts/toby-backend-doctor/083ec6d2-63e9-4c3e-b55e-a95301a4aa72/finding.md

Backend finding — blank-extension-page (2026-05-11)

Hand-authored·12 min read·16 sections·Last edited May 13 by initial import·View history

Reproduced (from backend side)?

no — the backend is not implicated. The Go API has been deployed on the same commit since 2026-04-01 with a sub-0.002 % error rate, ~1M req/day of 2xx, and zero correlated spike at or after 2026-04-09. The SW boot path has no listener-after-await bug. The hang lives in the renderer (toby.html tab) — almost certainly Chrome's "extension context invalidated" state dropping chrome.storage.local.get callbacks, which is a Chromium MV3 platform behaviour, not anything we ship on the Go side.

Symptoms observed in logs / SW / API

  • Prod-api is stable. prod-api-00427-9p2 (revision commit-sha=4b0107858e706c904e6cf2841fbcbf81a1e2f94f) has been the active prod revision since 2026-04-01T20:08Z. The two before it (00425, 00426) ran the same SHA, so the last actual code change to deployed prod-api is older than 2026-02-02 — well before the user-complaint window the frontend doctor narrowed to post-2026-04-09.
  • Error volume is rounding error. Last 24 h: 0 5xx in prod metrics (Cloud Run aggregated request_count, response_code_class=5xx); 23 ERROR-severity entries in stderr, of which 22 are expected 401s on stale-session PUT /v2/users/activity / GET /v2/states and 1 is a downstream toby-ai-api 500 on a cache-warm UpdateUserAICache call — none are on the user-hydration path.
  • 503 noise is normal cold-edge churn, not a regression. 51 × Cloud-Run "malformed response or connection error" 503s in the last 36 h are spread across /v2/states, /v2/surveys, /v2/users, /v2/public/announcements — all post-hydration endpoints. They cannot cause the pre-hydration blank-page hang.
  • No panics, fatals, or CRITICALs in the prod-api log stream since 2026-04-01.
  • DB is healthy. users (Toby Prod, read-only) shows 41,578 last-active in the past 24 h, 59,654 in the past 7 d, 720 new signups in the past 7 d — i.e. the API is successfully serving the hot user-activity path right now.
  • SW boot path is well-formed. Every chrome.*.addListener call in apps/extension/app/background/*.ts registers synchronously at module top level (or synchronously inside a non-awaiting init()). I could not find a listener-after-await MV3 boot-bug pattern.
  • Min-instances=4 on prod-api → cold-start hangs are not a factor for the request volumes in play.

Root cause (best hypothesis, backend lens)

Confidence: high that the proximate fault is not on the backend. The chrome.storage callback drop the frontend doctor identified is a Chromium MV3 renderer-side phenomenon ("extension context invalidated") that occurs in the toby.html tab, not in the SW or our API:

  1. Chrome auto-updates the extension while the user has a long-lived new-tab open (or the user reloads the extension themselves). Chrome marks the old tab's extension context invalidated; subsequent calls to chrome.storage.local.get from that tab return without invoking the callback (lastError is set, but getUser() at apps/extension/app/state/accessors/user.tsx:45-50 never reads chrome.runtime.lastError, so the dropped callback is indistinguishable from "just slow").
  2. The frontend's d68726b29 commit (2026-04-09) widened the AuthWrapper gate at apps/extension/app/containers/Toby.tsx:304 to depend on isUserHydrated, which is bound 1:1 to that one unbounded storage callback. The same chrome.storage drop now reliably produces a blank page where it used to either work or only flicker.
  3. The Go API is structurally incapable of causing this drop: getUser() does not make a network call. It reads from chrome.storage. The very first network request to prod-api would happen only after isUserHydrated flips true. The hang precedes any HTTP request.

What the backend doctor can refute, with evidence:

  • Not an API regression — code is from 2026-02-02 or earlier (commit SHA 4b0107858 has been deployed on three consecutive revisions back to revision 00425, 2026-02-02; the deploys since are config-only redeploys of the same SHA).
  • Not a 5xx storm — 0 of either 500 or 5xx-class in last 24 h (response_code_class="5xx", hourly time-series, all zero).
  • Not a CORS / cookie / CSP change — no commits to apps/api/middlewares/, apps/api/routes/, or any session handling code in 2026.
  • Not a database collapse — 41k DAU, healthy hourly distribution, no row-count anomaly.
  • Not a Cloud-Run cold-start race — min-instances=4 keeps the service permanently warm; the 503 "malformed response" blips are tiny (51 / 36 h) and don't touch the auth path.

Evidence

GCP logs — error volume since 2026-05-04

Filter (gcp-observability list_log_entries):

resource.type="cloud_run_revision"
  AND resource.labels.service_name="prod-api"
  AND severity>=ERROR
  AND timestamp>="2026-05-04T00:00:00Z"

Distinct error messages, count, all 23 entries:

   8  "401 PUT /v2/users/activity"
   8  "401 GET /v2/states"
   5  "failed to search nextItems"
   1  "failed to send AI request"            # downstream toby-ai-api 500
   1  "401 POST /v3/nexts/search"

All 401s. No panics. The one 500 is downstream toby-ai-api from services/ai/ai_service.go:171 UpdateUserCache — not on the auth or hydration path.

GCP metrics — 5xx by day (run.googleapis.com/request_count)

2026-05-10 → 2026-05-11   500-class: 0     2xx: 1,108,508
2026-05-09 → 2026-05-10   500-class: 4     2xx:   560,996
2026-05-08 → 2026-05-09   500-class: 0     2xx:   718,076
2026-05-07 → 2026-05-08   500-class: 19    2xx: 1,186,238
2026-05-06 → 2026-05-07   500-class: 8     2xx: 1,265,714

Worst day = 19 / 1.18 M ≈ 0.0016 %. No spike pattern.

Cloud-Run revisions (gcloud run revisions list --service=prod-api)

prod-api-00427-9p2  2026-04-01T20:08Z  commit 4b01078  ACTIVE
prod-api-00426-s6l  2026-04-01T17:58Z  commit 4b01078
prod-api-00425-x7f  2026-02-02T19:13Z  commit 4b01078
prod-api-00424-dpx  2026-01-28T20:16Z  commit 0b44adc

The deployed SHA hasn't actually changed since at least 2026-02-02. The 2026-04-01 deploys are config redeploys of the same SHA. Code on the wire today is older than the bug window.

Cloud-Run config (gcloud run services describe prod-api)

autoscaling.knative.dev/minScale: "4"
autoscaling.knative.dev/maxScale: "20"
containerConcurrency: 900
run.googleapis.com/cpu-throttling: "false"
run.googleapis.com/startup-cpu-boost: "true"

Always 4 warm instances; cold-start cannot explain a recurring user-visible issue.

DB query — Toby Prod (Read Only) be55a66b-c905-4759-9ce1-a97785bb69e6

SELECT
  COUNT(*) AS total_users,
  COUNT(*) FILTER (WHERE created_at  >= NOW() - INTERVAL '7 days') AS new_7d,
  COUNT(*) FILTER (WHERE created_at  >= NOW() - INTERVAL '1 day')  AS new_1d,
  COUNT(*) FILTER (WHERE last_active >= NOW() - INTERVAL '1 day')  AS active_1d,
  COUNT(*) FILTER (WHERE last_active >= NOW() - INTERVAL '7 days') AS active_7d
FROM users WHERE deleted_at IS NULL;
total_users  new_7d  new_1d  active_1d  active_7d
1,046,842    720     93      41,578     59,654

And hourly distribution (last 12 h):

2026-05-11 17:00Z   943
2026-05-11 16:00Z 2,427
2026-05-11 15:00Z 2,740
2026-05-11 14:00Z 3,015
2026-05-11 13:00Z 3,352   ← peak
2026-05-11 12:00Z 2,877
2026-05-11 11:00Z 2,072
2026-05-11 10:00Z 1,418
2026-05-11 09:00Z 1,443
2026-05-11 08:00Z 1,736
2026-05-11 07:00Z 2,018
2026-05-11 06:00Z 1,853

Healthy diurnal curve. No collapse, no flatline. The API is processing this traffic right now.

SW source — apps/extension/entrypoints/background.ts

import { defineBackground } from 'wxt/utils/define-background';
import { persistQueryClientRestore } from '@tanstack/react-query-persist-client';
import { IDBpersister, queryClient } from '~/state/client';

// These register chrome listeners as side effects
import '~/background/contextMenus';
import '~/background/inject';
import '~/background/badge';
import '~/background/onInstall';
import '~/background/tobyLinks';
import '~/background/omnibox';

export default defineBackground(() => {
  persistQueryClientRestore({
    queryClient,
    persister: IDBpersister,
    maxAge: Infinity,
    hydrateOptions: { defaultOptions: { queries: { gcTime: Infinity } } },
  });
});

The side-effect imports happen before defineBackground is called, so they register listeners synchronously at module evaluation. No MV3 boot-bug.

SW listeners (all top-level / no await before addListener)

  • apps/extension/app/background/onInstall.ts:9chrome.runtime.onInstalled.addListener (sync, top-level).
  • apps/extension/app/background/inject.ts:140,204chrome.tabs.onUpdated.addListener, chrome.runtime.onMessage.addListener (both sync, after only consts).
  • apps/extension/app/background/badge.ts:4-5chrome.action.setBadge* (sync, top-level; no listeners).
  • apps/extension/app/background/omnibox.ts:1chrome.omnibox.onInputEntered.addListener (sync, top-level).
  • apps/extension/app/background/tobyLinks.ts:5-81chrome.webRequest.onBeforeRequest.addListener × 2, chrome.tabs.onUpdated.addListener (sync, all inside an if (isFireFox()) gate; Chrome users skip this entirely).
  • apps/extension/app/background/contextMenus.ts:217-261 — declared async function init() but no await anywhere before the three .addListener calls on lines 218, 219, 231. Listener attachment is therefore synchronous. ✓

The pattern check confirms what the prod metrics already imply: even when SWs do die and restart, the listeners re-attach correctly. The MV3 boot path isn't the wound.

SW shortcomings I did find (defence-in-depth, not the root cause)

  • apps/extension/app/background/contextMenus.ts:145-163 — fetches ${env.API_URL_V3}/cards from the SW with no AbortController / no timeout. If the network stalls during a context-menu save and the SW idles past 30 s, the in-flight fetch keeps the SW alive until tcp resets — not a crash, but means SW lifetimes are jittery on bad networks.
  • apps/extension/entrypoints/background.ts:14-19persistQueryClientRestore(...) inside defineBackground is fire-and-forget with no .catch(). If IDB throws on cold open, it's silently swallowed; the queryClient stays empty and any SW callsite reading from it gets stale-empty data. This is the same IDB-availability surface the frontend doctor flagged for useIsRestoring() — except in the SW it's even less visible because the SW has no UI.
  • apps/extension/app/state/client.ts:24-29IDBpersister.restoreClient calls get(idbValidKey) from idb-keyval with no error path. If IDB is locked or quota-exceeded, the promise rejects deep inside persistQueryClientRestore. Surface ditto.

None of those cause the blank page. They explain why the "silent" failure mode is so silent.

Recent commits — Go API since 2026-01-01 (everything)

b9bea18c  Post fallback Slack message when CWS review AI draft fails  (#10)
4666eefc  Merge pull request #4 (CWS review monitor)
cbc92a78  Make retention discount eligible for all cancellation reasons
ba247d9a  Add Chrome Web Store review monitor with AI-drafted responses
0727449d  Migrate landing page CI and remove dead workflows
5bd96126  Add Turborepo configuration
7effe8b6  Refine legacy user discount logic in subscription context  (#876)
81f1ab57  Refactor retention offer logic                              (#875)
a462d067  Enhance user email verification logic … token validation    (#874)
36fe2451  Refactor retention offer logic to utilize Stripe coupon     (#872)
52f51924  Enhance retention offer logic and legacy user support       (#871)
bd6f64a8  Add retention discount functionality                        (#869)
9fb43ddd  Fix build error from last commit                            (#870)
7c7a1cdf  Remove Legacy Discount from getCoupon                       (#868)

None of these touch apps/api/middlewares, apps/api/routes, session/cookie handling, or anything the extension hits on boot. a462d067 is the only one in user-verification territory but it predates the latest prod deploy (2026-04-01) and is on a Stripe verify path, not the SW-relevant chrome.storage seed.

Proposed fix (backend / SW side)

There is no backend-side fix that resolves the user-visible bug, because the bug is not on the backend. The frontend doctor's Layer 1 (timeout the hydration promise) and Layer 2 (recovery UI) are the correct, sufficient fixes.

That said, three SW-side hardening items are worth shipping alongside the FE fix, because they remove ambiguity from future incidents of this shape:

  1. Catch and log the persist-restore failure in the SW. apps/extension/entrypoints/background.ts:14:

     export default defineBackground(() => {
    -  persistQueryClientRestore({
    +  persistQueryClientRestore({
         queryClient,
         persister: IDBpersister,
         maxAge: Infinity,
         hydrateOptions: { defaultOptions: { queries: { gcTime: Infinity } } },
    -  });
    +  }).catch((err) => {
    +    console.error('[toby-sw] persistQueryClientRestore failed', err);
    +  });
     });
    

    Zero functional change; just makes IDB failures visible in chrome://extensions → service worker → console.

  2. Bound the SW fetch(/v3/cards) and fetch(/lists) paths with an AbortController. apps/extension/app/background/contextMenus.ts:145-191. A 10 s timeout + abort. Prevents the SW from being kept alive by a stuck TCP socket on bad networks.

  3. Promisify the SW's chrome.storage reads with a lastError check at the source. Build a single chromeStorageGet<T>(keys, { timeoutMs }) helper that:

    • reads chrome.runtime.lastError on every callback,
    • rejects on a timeout,
    • rejects if chrome.runtime.id is undefined (the canonical signal that the extension context is invalidated).

    Replace every raw chrome.storage.local.get(key, cb) in app/state/accessors/*.tsx and app/utils/chromeapi.ts:248 with that helper. This is the real defence — the FE's Layer 1 only handles the getUser site; this helper handles every site uniformly. (It belongs in the FE codebase; from the backend lens, I'm flagging it because the chrome.storage drop is a class of bug, not a one-off.)

I would NOT redeploy prod-api as part of this incident — the API code is innocent and a redeploy is a needless blast-radius increase.

Verify plan

  1. Confirm prod-api stability over the next 24 h. Re-run the 5xx and ERROR-severity log queries above; expect the same near-zero counts. No action if so.

  2. After FE Layer 1 ships, watch for the new NewTabHangShown beacon the frontend doctor recommended (Layer 3). If it fires at non-trivial volume without a corresponding 5xx spike in prod-api, that confirms the chrome.storage drop is platform-side and we ship Layer 1+2 as-is.

  3. If the beacon volume correlates with a Cloud-Run revision bounce (use gcloud run revisions list --service=prod-api to spot any 5xx jump on a new revision), re-open the backend investigation. With the current cadence (one prod-api code change in 4 months) this is unlikely.

  4. Add a synthetic SW health probe. A Cloud Scheduler job that hits https://api2.gettoby.com/v2/states with a throwaway token every 5 min and alerts if 5xx > 2 % over a 15 min window. Right now we have no signal between "user complains in CWS review" and "Sentry/Amplitude burn". Out-of-scope for this incident; file as follow-up.

  5. Optional repro for SW context invalidation. In a dev profile: open chrome-extension://<id>/toby.html → chrome://extensions → toggle the extension off and back on. The open tab should now have chrome.runtime.id === undefined and any chrome.storage.local.get from it returns without invoking the callback. This is the canonical scenario; if FE Layer 1 rescues the page after 5 s in this state, the fix is verified.

Confirm or refute frontend hypothesis

Refine, then confirm. The frontend doctor wrote:

If the MV3 service worker is dead or the extension context is invalidated … chrome.storage.local.get can return without invoking its callback.

The "MV3 service worker is dead" half is unlikely to be the cause in isolation. chrome.storage.local.get reads are handled by the Chrome browser process, not by the extension SW — a dead SW alone does not drop callbacks; Chrome will quietly fulfill the storage read either way. What does drop callbacks is the renderer (the toby.html tab) being in the "extension context invalidated" state. That state is most commonly entered when:

  • Chrome auto-updates the extension while a chrome_url_override new-tab is open (Toby owns the new-tab page → this happens to every Toby user, every time we ship an update);
  • the user manually disables/re-enables or reloads the extension;
  • the SW crashes during a critical phase that invalidates the handshake (rare, but possible — and the SW IS more crash-prone than it should be because of the unhandled persistQueryClientRestore rejection and the un-aborted fetches I flagged above).

So the chrome.storage callback drop is real, the FE-side fix is correct and sufficient, and the SW-hardening items in §"Proposed fix" reduce the frequency of the drop (by reducing renderer/SW invalidations) without changing the severity of any single occurrence. Ship FE Layer 1+2 regardless; backend SW hardening is a nice-to-have follow-up.

The frontend doctor's hypothesis that there was a "Manifest V3 service-worker boot regression" — proposed by toby-product-strategist artifact 388c1db4 — is not supported by the evidence. The SW boot code is well-formed (listeners synchronous, no await-before-addListener), and prod-api hasn't changed in months. The post-2026-04-09 sensitivity is purely the FE widening the gate; the underlying chrome.storage drop is an evergreen Chromium MV3 phenomenon that the FE used to tolerate accidentally.