# Backend finding — blank-extension-page (2026-05-11)

## Reproduced (from backend side)?

**no** — the backend is not implicated. The Go API has been
deployed on the same commit since 2026-04-01 with a sub-0.002 %
error rate, ~1M req/day of 2xx, and zero correlated spike at or
after 2026-04-09. The SW boot path has no listener-after-await bug.
The hang lives in the renderer (toby.html tab) — almost certainly
Chrome's "extension context invalidated" state dropping
`chrome.storage.local.get` callbacks, which is a Chromium MV3
platform behaviour, not anything we ship on the Go side.

## Symptoms observed in logs / SW / API

- **Prod-api is stable.** `prod-api-00427-9p2` (revision
  `commit-sha=4b0107858e706c904e6cf2841fbcbf81a1e2f94f`) has been
  the active prod revision since 2026-04-01T20:08Z. The two before
  it (00425, 00426) ran the same SHA, so the **last actual code
  change to deployed prod-api is older than 2026-02-02** — well
  before the user-complaint window the frontend doctor narrowed to
  **post-2026-04-09**.
- **Error volume is rounding error.** Last 24 h: **0** 5xx in prod
  metrics (Cloud Run aggregated `request_count`,
  `response_code_class=5xx`); 23 ERROR-severity entries in stderr,
  of which **22** are expected 401s on stale-session
  `PUT /v2/users/activity` / `GET /v2/states` and 1 is a downstream
  `toby-ai-api 500` on a cache-warm `UpdateUserAICache` call —
  none are on the user-hydration path.
- **503 noise is normal cold-edge churn**, not a regression. 51 ×
  Cloud-Run "malformed response or connection error" 503s in the
  last 36 h are spread across `/v2/states`, `/v2/surveys`,
  `/v2/users`, `/v2/public/announcements` — all post-hydration
  endpoints. They cannot cause the *pre-hydration* blank-page hang.
- **No panics, fatals, or CRITICALs** in the prod-api log stream
  since 2026-04-01.
- **DB is healthy.** `users` (Toby Prod, read-only) shows 41,578
  last-active in the past 24 h, 59,654 in the past 7 d, 720 new
  signups in the past 7 d — i.e. **the API is successfully
  serving the hot user-activity path right now**.
- **SW boot path is well-formed.** Every `chrome.*.addListener`
  call in `apps/extension/app/background/*.ts` registers
  synchronously at module top level (or synchronously inside a
  non-awaiting `init()`). I could not find a listener-after-await
  MV3 boot-bug pattern.
- **Min-instances=4** on prod-api → cold-start hangs are not a
  factor for the request volumes in play.

## Root cause (best hypothesis, backend lens)

**Confidence: high** that the proximate fault is **not** on the
backend. The chrome.storage callback drop the frontend doctor
identified is a **Chromium MV3 renderer-side phenomenon** ("extension
context invalidated") that occurs in the **toby.html tab**, not in
the SW or our API:

1. Chrome auto-updates the extension while the user has a
   long-lived new-tab open (or the user reloads the extension
   themselves). Chrome marks the old tab's extension context
   invalidated; subsequent calls to `chrome.storage.local.get` from
   that tab return without invoking the callback (lastError is
   set, but `getUser()` at
   `apps/extension/app/state/accessors/user.tsx:45-50` never
   reads `chrome.runtime.lastError`, so the dropped callback is
   indistinguishable from "just slow").
2. The frontend's `d68726b29` commit (2026-04-09) widened the
   `AuthWrapper` gate at
   `apps/extension/app/containers/Toby.tsx:304` to depend on
   `isUserHydrated`, which is bound 1:1 to that one unbounded
   storage callback. **The same chrome.storage drop now reliably
   produces a blank page where it used to either work or only
   flicker.**
3. The Go API is structurally incapable of causing this drop:
   `getUser()` does **not** make a network call. It reads from
   chrome.storage. The very first network request to `prod-api`
   would happen only after `isUserHydrated` flips true. The hang
   precedes any HTTP request.

What the backend doctor *can* refute, with evidence:
- Not an API regression — code is from 2026-02-02 or earlier
  (commit SHA 4b0107858 has been deployed on three consecutive
  revisions back to revision 00425, 2026-02-02; the deploys since
  are config-only redeploys of the same SHA).
- Not a 5xx storm — 0 of either 500 or 5xx-class in last 24 h
  (`response_code_class="5xx"`, hourly time-series, all zero).
- Not a CORS / cookie / CSP change — no commits to
  `apps/api/middlewares/`, `apps/api/routes/`, or any session
  handling code in 2026.
- Not a database collapse — 41k DAU, healthy hourly distribution,
  no row-count anomaly.
- Not a Cloud-Run cold-start race — min-instances=4 keeps the
  service permanently warm; the 503 "malformed response" blips
  are tiny (51 / 36 h) and don't touch the auth path.

## Evidence

### GCP logs — error volume since 2026-05-04

Filter (gcp-observability `list_log_entries`):
```
resource.type="cloud_run_revision"
  AND resource.labels.service_name="prod-api"
  AND severity>=ERROR
  AND timestamp>="2026-05-04T00:00:00Z"
```

Distinct error messages, count, all 23 entries:
```
   8  "401 PUT /v2/users/activity"
   8  "401 GET /v2/states"
   5  "failed to search nextItems"
   1  "failed to send AI request"            # downstream toby-ai-api 500
   1  "401 POST /v3/nexts/search"
```
All 401s. No panics. The one 500 is downstream `toby-ai-api` from
`services/ai/ai_service.go:171 UpdateUserCache` — not on the auth
or hydration path.

### GCP metrics — 5xx by day (`run.googleapis.com/request_count`)

```
2026-05-10 → 2026-05-11   500-class: 0     2xx: 1,108,508
2026-05-09 → 2026-05-10   500-class: 4     2xx:   560,996
2026-05-08 → 2026-05-09   500-class: 0     2xx:   718,076
2026-05-07 → 2026-05-08   500-class: 19    2xx: 1,186,238
2026-05-06 → 2026-05-07   500-class: 8     2xx: 1,265,714
```
Worst day = 19 / 1.18 M ≈ 0.0016 %. No spike pattern.

### Cloud-Run revisions (`gcloud run revisions list --service=prod-api`)

```
prod-api-00427-9p2  2026-04-01T20:08Z  commit 4b01078  ACTIVE
prod-api-00426-s6l  2026-04-01T17:58Z  commit 4b01078
prod-api-00425-x7f  2026-02-02T19:13Z  commit 4b01078
prod-api-00424-dpx  2026-01-28T20:16Z  commit 0b44adc
```
The deployed SHA hasn't *actually changed* since at least
2026-02-02. The 2026-04-01 deploys are config redeploys of the
same SHA. **Code on the wire today is older than the bug window.**

### Cloud-Run config (`gcloud run services describe prod-api`)

```
autoscaling.knative.dev/minScale: "4"
autoscaling.knative.dev/maxScale: "20"
containerConcurrency: 900
run.googleapis.com/cpu-throttling: "false"
run.googleapis.com/startup-cpu-boost: "true"
```
Always 4 warm instances; cold-start cannot explain a recurring
user-visible issue.

### DB query — Toby Prod (Read Only) `be55a66b-c905-4759-9ce1-a97785bb69e6`

```sql
SELECT
  COUNT(*) AS total_users,
  COUNT(*) FILTER (WHERE created_at  >= NOW() - INTERVAL '7 days') AS new_7d,
  COUNT(*) FILTER (WHERE created_at  >= NOW() - INTERVAL '1 day')  AS new_1d,
  COUNT(*) FILTER (WHERE last_active >= NOW() - INTERVAL '1 day')  AS active_1d,
  COUNT(*) FILTER (WHERE last_active >= NOW() - INTERVAL '7 days') AS active_7d
FROM users WHERE deleted_at IS NULL;
```
```
total_users  new_7d  new_1d  active_1d  active_7d
1,046,842    720     93      41,578     59,654
```
And hourly distribution (last 12 h):
```
2026-05-11 17:00Z   943
2026-05-11 16:00Z 2,427
2026-05-11 15:00Z 2,740
2026-05-11 14:00Z 3,015
2026-05-11 13:00Z 3,352   ← peak
2026-05-11 12:00Z 2,877
2026-05-11 11:00Z 2,072
2026-05-11 10:00Z 1,418
2026-05-11 09:00Z 1,443
2026-05-11 08:00Z 1,736
2026-05-11 07:00Z 2,018
2026-05-11 06:00Z 1,853
```
Healthy diurnal curve. No collapse, no flatline. The API is
processing this traffic right now.

### SW source — `apps/extension/entrypoints/background.ts`

```typescript
import { defineBackground } from 'wxt/utils/define-background';
import { persistQueryClientRestore } from '@tanstack/react-query-persist-client';
import { IDBpersister, queryClient } from '~/state/client';

// These register chrome listeners as side effects
import '~/background/contextMenus';
import '~/background/inject';
import '~/background/badge';
import '~/background/onInstall';
import '~/background/tobyLinks';
import '~/background/omnibox';

export default defineBackground(() => {
  persistQueryClientRestore({
    queryClient,
    persister: IDBpersister,
    maxAge: Infinity,
    hydrateOptions: { defaultOptions: { queries: { gcTime: Infinity } } },
  });
});
```
The side-effect imports happen *before* `defineBackground` is
called, so they register listeners synchronously at module
evaluation. No MV3 boot-bug.

### SW listeners (all top-level / no await before addListener)

- `apps/extension/app/background/onInstall.ts:9` — `chrome.runtime.onInstalled.addListener` (sync, top-level).
- `apps/extension/app/background/inject.ts:140,204` — `chrome.tabs.onUpdated.addListener`, `chrome.runtime.onMessage.addListener` (both sync, after only `const`s).
- `apps/extension/app/background/badge.ts:4-5` — `chrome.action.setBadge*` (sync, top-level; no listeners).
- `apps/extension/app/background/omnibox.ts:1` — `chrome.omnibox.onInputEntered.addListener` (sync, top-level).
- `apps/extension/app/background/tobyLinks.ts:5-81` — `chrome.webRequest.onBeforeRequest.addListener` × 2, `chrome.tabs.onUpdated.addListener` (sync, all inside an `if (isFireFox())` gate; Chrome users skip this entirely).
- `apps/extension/app/background/contextMenus.ts:217-261` — declared `async function init()` but **no `await` anywhere before** the three `.addListener` calls on lines 218, 219, 231. Listener attachment is therefore synchronous. ✓

The pattern check confirms what the prod metrics already imply:
even when SWs *do* die and restart, the listeners re-attach
correctly. The MV3 boot path isn't the wound.

### SW shortcomings I *did* find (defence-in-depth, not the root cause)

- `apps/extension/app/background/contextMenus.ts:145-163` — fetches `${env.API_URL_V3}/cards` from the SW with **no `AbortController` / no timeout**. If the network stalls during a context-menu save and the SW idles past 30 s, the in-flight fetch keeps the SW alive until tcp resets — not a crash, but means SW lifetimes are jittery on bad networks.
- `apps/extension/entrypoints/background.ts:14-19` — `persistQueryClientRestore(...)` inside `defineBackground` is fire-and-forget with **no `.catch()`**. If IDB throws on cold open, it's silently swallowed; the queryClient stays empty and any SW callsite reading from it gets stale-empty data. This is the same IDB-availability surface the frontend doctor flagged for `useIsRestoring()` — except in the SW it's even less visible because the SW has no UI.
- `apps/extension/app/state/client.ts:24-29` — `IDBpersister.restoreClient` calls `get(idbValidKey)` from `idb-keyval` with no error path. If IDB is locked or quota-exceeded, the promise rejects deep inside `persistQueryClientRestore`. Surface ditto.

None of those *cause* the blank page. They explain why the
"silent" failure mode is so silent.

### Recent commits — Go API since 2026-01-01 (everything)

```
b9bea18c  Post fallback Slack message when CWS review AI draft fails  (#10)
4666eefc  Merge pull request #4 (CWS review monitor)
cbc92a78  Make retention discount eligible for all cancellation reasons
ba247d9a  Add Chrome Web Store review monitor with AI-drafted responses
0727449d  Migrate landing page CI and remove dead workflows
5bd96126  Add Turborepo configuration
7effe8b6  Refine legacy user discount logic in subscription context  (#876)
81f1ab57  Refactor retention offer logic                              (#875)
a462d067  Enhance user email verification logic … token validation    (#874)
36fe2451  Refactor retention offer logic to utilize Stripe coupon     (#872)
52f51924  Enhance retention offer logic and legacy user support       (#871)
bd6f64a8  Add retention discount functionality                        (#869)
9fb43ddd  Fix build error from last commit                            (#870)
7c7a1cdf  Remove Legacy Discount from getCoupon                       (#868)
```
**None** of these touch `apps/api/middlewares`, `apps/api/routes`,
session/cookie handling, or anything the extension hits on boot.
`a462d067` is the only one in user-verification territory but it
predates the latest prod deploy (2026-04-01) and is on a Stripe
verify path, not the SW-relevant chrome.storage seed.

## Proposed fix (backend / SW side)

There is no backend-side fix that resolves the user-visible bug,
because the bug is not on the backend. The frontend doctor's
**Layer 1 (timeout the hydration promise)** and **Layer 2
(recovery UI)** are the correct, sufficient fixes.

That said, three **SW-side hardening** items are worth shipping
alongside the FE fix, because they remove ambiguity from future
incidents of this shape:

1. **Catch and log the persist-restore failure in the SW.**
   `apps/extension/entrypoints/background.ts:14`:

   ```diff
    export default defineBackground(() => {
   -  persistQueryClientRestore({
   +  persistQueryClientRestore({
        queryClient,
        persister: IDBpersister,
        maxAge: Infinity,
        hydrateOptions: { defaultOptions: { queries: { gcTime: Infinity } } },
   -  });
   +  }).catch((err) => {
   +    console.error('[toby-sw] persistQueryClientRestore failed', err);
   +  });
    });
   ```

   Zero functional change; just makes IDB failures *visible* in
   `chrome://extensions → service worker → console`.

2. **Bound the SW `fetch(/v3/cards)` and `fetch(/lists)` paths
   with an AbortController.**
   `apps/extension/app/background/contextMenus.ts:145-191`. A
   10 s timeout + abort. Prevents the SW from being kept alive by
   a stuck TCP socket on bad networks.

3. **Promisify the SW's `chrome.storage` reads with a
   `lastError` check at the source.** Build a single
   `chromeStorageGet<T>(keys, { timeoutMs })` helper that:
   - reads `chrome.runtime.lastError` on every callback,
   - rejects on a timeout,
   - rejects if `chrome.runtime.id` is undefined (the canonical
     signal that the extension context is invalidated).

   Replace every raw `chrome.storage.local.get(key, cb)` in
   `app/state/accessors/*.tsx` and `app/utils/chromeapi.ts:248`
   with that helper. This is the *real* defence — the FE's Layer
   1 only handles the `getUser` site; this helper handles every
   site uniformly. (It belongs in the FE codebase; from the
   backend lens, I'm flagging it because the chrome.storage drop
   is a class of bug, not a one-off.)

I would NOT redeploy prod-api as part of this incident — the API
code is innocent and a redeploy is a needless blast-radius
increase.

## Verify plan

1. **Confirm prod-api stability over the next 24 h.** Re-run the
   5xx and ERROR-severity log queries above; expect the same
   near-zero counts. No action if so.

2. **After FE Layer 1 ships, watch for the new `NewTabHangShown`
   beacon** the frontend doctor recommended (Layer 3). If it
   fires at non-trivial volume *without* a corresponding 5xx
   spike in prod-api, that confirms the chrome.storage drop is
   platform-side and we ship Layer 1+2 as-is.

3. **If the beacon volume correlates with a Cloud-Run revision
   bounce** (use `gcloud run revisions list --service=prod-api`
   to spot any 5xx jump on a new revision), re-open the backend
   investigation. With the current cadence (one prod-api code
   change in 4 months) this is unlikely.

4. **Add a synthetic SW health probe.** A Cloud Scheduler job
   that hits `https://api2.gettoby.com/v2/states` with a
   throwaway token every 5 min and alerts if 5xx > 2 % over a
   15 min window. Right now we have no signal between "user
   complains in CWS review" and "Sentry/Amplitude burn".
   Out-of-scope for this incident; file as follow-up.

5. **Optional repro for SW context invalidation.** In a dev
   profile: open `chrome-extension://<id>/toby.html` →
   chrome://extensions → toggle the extension off and back on.
   The open tab should now have `chrome.runtime.id === undefined`
   and any `chrome.storage.local.get` from it returns without
   invoking the callback. This is the canonical scenario; if FE
   Layer 1 rescues the page after 5 s in this state, the fix is
   verified.

## Confirm or refute frontend hypothesis

**Refine, then confirm.** The frontend doctor wrote:

> If the MV3 service worker is dead or the extension context is
> invalidated … `chrome.storage.local.get` can return without
> invoking its callback.

The "MV3 service worker is dead" half is **unlikely to be the
cause** in isolation. `chrome.storage.local.get` reads are handled
by the Chrome browser process, not by the extension SW — a dead
SW alone does not drop callbacks; Chrome will quietly fulfill the
storage read either way. What **does** drop callbacks is the
*renderer* (the toby.html tab) being in the
**"extension context invalidated"** state. That state is most
commonly entered when:

  - Chrome auto-updates the extension while a chrome_url_override
    new-tab is open (Toby owns the new-tab page → this happens to
    every Toby user, every time we ship an update);
  - the user manually disables/re-enables or reloads the
    extension;
  - the SW crashes *during* a critical phase that invalidates the
    handshake (rare, but possible — and the SW IS more crash-prone
    than it should be because of the unhandled
    `persistQueryClientRestore` rejection and the un-aborted
    fetches I flagged above).

So the **chrome.storage callback drop is real**, the
**FE-side fix is correct and sufficient**, and the SW-hardening
items in §"Proposed fix" reduce the *frequency* of the drop
(by reducing renderer/SW invalidations) without changing the
*severity* of any single occurrence. Ship FE Layer 1+2
regardless; backend SW hardening is a nice-to-have follow-up.

The frontend doctor's hypothesis that there was a "Manifest V3
service-worker boot regression" — proposed by
`toby-product-strategist` artifact `388c1db4` — is **not
supported by the evidence**. The SW boot code is well-formed
(listeners synchronous, no await-before-addListener), and prod-api
hasn't changed in months. The post-2026-04-09 sensitivity is
**purely** the FE widening the gate; the underlying chrome.storage
drop is an evergreen Chromium MV3 phenomenon that the FE used to
tolerate accidentally.
