# Validation — blank-extension-page (2026-05-11)

## Verdict

`validated`

Confidence: **high**

The root-cause story is supported by the live code, the proximate commit
exists with the claimed shape, and the backend's "innocent" verdict
matches what I see in Cloud Run + the production DB right now. The
proposed Layer 1 + Layer 2 + Layer 3 fix is correct, race-safe,
preserves the original intent of `d68726b29`, and exposes nothing
sensitive. Ship it. Two open follow-ups noted below — neither blocks
closing this incident.

## Spot-checks performed

- **Code, `apps/extension/app/containers/Toby.tsx:304`** — read directly.
  Exact text: `if (isInitializing || !isDraftReady || !isUserHydrated) return null;`.
  Matches the synthesis claim 1:1. AuthWrapper at 277-314 is exactly as
  the FE doctor described. Both `<App>` and `<Onboarding2>` are inside
  this wrapper, confirming a null return kills the entire visible UI.
- **Code, `apps/extension/app/state/accessors/user.tsx:45-50`** — read directly.
  `getUser` body is exactly:
  ```tsx
  export const getUser = () =>
    new Promise<LoginResponse | null>((resolve) => {
      chrome.storage.local.get('user', ({ user }) => {
        resolve(user ?? null);
      });
    });
  ```
  Confirmed: **no** `chrome.runtime.lastError` check, **no** `.catch`,
  **no** timeout. `UserProvider` useEffect at 71-76 is also exactly as
  described — `getUser().then(...)` with no fallback. Unbounded surface
  is real.
- **Code, `apps/extension/app/utils/chromeapi.ts:248-259`** — confirms
  the *same* missing pattern in `getChromeStorage`, which
  `onboarding2DraftStorage.get()` is built on. The class of bug spans
  beyond `getUser`.
- **Code, `apps/extension/entrypoints/toby/index.html:62-69`** — preload
  skeleton is the static markup the user sees when `#root` has no
  children. Matches the FE doctor's repro shape 1:1.
- **Git, commit `d68726b29`** — exists, dated `2026-04-09T19:07:05 +03:00`,
  author Jad Haidar, message *"fix: gate AuthWrapper on user hydration
  to prevent duplicate onboarding events"*, touches only
  `apps/extension/app/containers/Toby.tsx` (+5 / -1). The diff adds the
  `!isUserHydrated` clause to the gate — exactly as the synthesis claims.
- **Cloud Run, `gcloud run revisions list --service=prod-api
  --region=us-east4`** — confirms the four most recent revisions
  reported by the backend doctor verbatim: 00427-9p2 (2026-04-01 20:08Z,
  ACTIVE) → 00426-s6l (2026-04-01 17:58Z) → 00425-x7f (2026-02-02 19:13Z)
  all on commit `4b0107858`; 00424-dpx (2026-01-28 20:16Z) on the
  preceding SHA `0b44adc`. **The deployed SHA has not changed since
  2026-02-02** — well outside the post-2026-04-09 user-complaint window.
  Backend cannot be the regression.
- **DB, Toby Prod RO** — re-ran the 1-day active-user count:
  `41,630` users active in last 24 h vs. `41,578` reported by the BE
  doctor a few hours earlier. Healthy hot path; matches diurnal
  expectation.
- **Logging spot-check** — `list_log_entries` returned a permission-denied
  for the validator's IAM scope (the BE doctor's account has broader log
  scope). I could not independently re-run the 5xx / ERROR-severity
  count. The Cloud Run revision evidence + healthy DB queries are
  strong-enough surrogates that I do not consider this a blocker for
  the verdict — but flagging it for transparency. **Did not personally
  re-verify the "0 5xx in last 24h" claim**; relying on BE doctor's
  citation plus the Cloud Run + DB cross-evidence.
- **Playwright synthetic repro** — did not personally re-run it. The FE
  doctor's `repro-blank-page.png` and the static HTML at
  `index.html:62-69` make the visual repro a one-step trivial mapping
  (DOM with `.preloadedBg` skeleton + empty `#root`). The accessibility
  snapshot being empty is mechanically a function of having no React
  children — I accept this without re-running.

## Correctness audit

- **Does `setIsUserHydrated(true)` on timeout actually unblock the
  gate?** Yes. `!isUserHydrated` becomes `false`; the gate collapses
  to `isInitializing || !isDraftReady`. If those are healthy, the
  AuthWrapper proceeds to render `<App>` or `<Onboarding2>`. The
  unblock works.
- **Race between timeout and legitimate `.then` resolution?** Safe.
  When `getUser` resolves before 5s, `.then` runs synchronously to
  `setIsUserHydrated(true)` and `.finally(() => clearTimeout(timeout))`
  runs in the same microtask flush. The 5s `setTimeout` (macrotask)
  is cancelled before it can fire. No double-fire on the happy path.
- **Race between timeout firing and a *slow* `.then` arriving after
  5s?** Layer 1 sets `isUserHydrated=true` at 5s; if the storage
  callback then fires at e.g. 5.5s, `.then` runs, `cancelled` is still
  false (mount lives), and `setUser(user); setIsUserHydrated(true)`
  runs — so a legitimate but slow user record IS still applied. The
  in-memory user is preserved (not clobbered). **The non-destructive
  property is the key thing that makes this fix safe to ship.**
- **`cancelled` flag semantics.** The flag flips only on unmount.
  All three sites (timeout, `.then`, `.catch`) gate on it before
  calling state setters. Correctly prevents state updates after the
  component unmounts. React 18 StrictMode double-invoke is handled
  cleanly because each effect run has its own `cancelled` and its own
  `timeout` in closure.
- **Completeness gap (isInitializing).** Layer 1 patches `getUser`
  and (per `Apply the same shape to useOnboarding2Draft.ts:12-30`)
  the draft path. **It does NOT patch the third gate boolean
  `isInitializing`**, which depends on `useIsRestoring()` from the
  react-query persistor (IDB-backed; same hang class — flagged by
  both the FE doctor at finding lines 99-101 and the BE doctor at
  finding lines 230-235). Layer 2's 8-second `StuckRecoveryScreen` is
  the backstop for this case, so the user is *never* stuck on a
  blank page — but they may have to tap a recovery button instead of
  the page self-healing at 5s. Acceptable for the close, but should
  be a follow-up to apply a `useIsRestoring`-flavoured Layer 1 patch
  (e.g. a local 5s timer in `useHandleRedirectFromQueryParams` that
  forces `setIsLoading(false)` even if `isRestoringCachedData` is
  still true). Listed under Recommendation, not under Objections.

## Quality audit

- **Preserves `d68726b29` intent on the happy path.** When `getUser`
  resolves within 5s with a returning user, the gate behaves exactly
  as it does today: no flash of `<Onboarding2>`. The fix only changes
  behaviour in the failure case (chrome.storage callback dropped) and
  in the very-slow case (>5s) — and even the >5s case applies the user
  once it eventually arrives, so the "duplicate onboarding events"
  d68726b29 was protecting against can only occur in the precise
  scenario where storage is slower than 5s AND the resolved user
  exists. That's the genuine pathological state — flashing Onboarding2
  briefly is the lesser evil over the blank-page hang.
- **Code style.** Idiomatic React-18-era effect cleanup pattern.
  `let cancelled` + `clearTimeout` + chained `.then/.catch/.finally`
  is standard; the cleanup function correctly handles both timers
  and in-flight promise resolution.
- **Logging.** `console.warn` on the fallback and `console.error` on
  the catch are appropriately verbose — they surface the rare-path
  behaviour without spamming the happy path.
- **Telemetry (Layer 3).** Beacon fires at the Layer-2 8s site, not
  at the Layer-1 5s site. That means the Layer-1 fallback (which is
  the *common* recovery path post-fix) is invisible in Amplitude.
  Consider firing a second, lower-stakes event at the Layer-1 timeout
  site (e.g. `NewTabHydrationTimeout`). Polish, not a blocker.
- **Pre-approved copy.** *"Your tabs are safe. Tap to recover."* is
  consistent with `toby/strategy/playbook.md` O1 KR1 and `00-state-of-the-project.md:50`.
  Operator does not need a copy review.

## Security audit

- **Does fail-open to "null user" expose sensitive data?** No. The
  fallback only sets the React state `isUserHydrated=true`; it does
  NOT call `setUser(null)` (which would clobber chrome.storage). The
  in-memory `user` state remains `null` until either the slow
  storage callback eventually resolves OR the user signs in again
  through Onboarding2. Onboarding2 is fully public, pre-auth UI —
  there is nothing token-gated that renders before authentication.
- **Token / session integrity.** The fix does not touch `setUser`,
  `getToken`, the OAuth callback handler, or any token-bearing
  request path. Auth posture is unchanged.
- **State that should NOT re-initialise.** None. The fallback is
  purely the boolean gate flip; user, draft, queryClient cache, and
  chrome.storage values are untouched.
- **CSP / `chrome.runtime.id` / context invalidated.** The fix
  *responds to* the context-invalidated state but does not attempt
  to recover from it (no `chrome.runtime.id` check, no force-reload).
  That's appropriate at this layer — Layer 2's "tap to recover"
  reload is the human escape. Whether to detect
  `chrome.runtime.id === undefined` and auto-reload is a product
  question for follow-up.
- **No new attack surface.** No new code paths handling untrusted
  input. No CSP changes. No JWT / cookie handling. No secrets.
- **Security-review skill gate.** The fix does NOT touch auth,
  sessions, tokens, secrets, JWT, or CSP, so the mandatory deep
  security-review skill does not need to spin up. The lighter
  review above is sufficient.

## Objections (if rejected or conditional)

(None — verdict is `validated`.)

## Recommendation

Coordinator should publish the incident doc with `status: closed`
and ship the FE Layer 1 + Layer 2 + Layer 3 fix as the resolution.
The synthesis is accurate, the fix is correct, race-safe, and
non-destructive, and the backend's "innocent" finding holds up under
independent re-check.

**Three follow-ups to file as separate work — do NOT block closing
this incident on them:**

1. **Apply the same shape to `isInitializing`** (specifically the
   `useIsRestoring()` IDB-backed path inside
   `useHandleRedirectFromQueryParams` at `Toby.tsx:168-275`). Today
   Layer 2 catches this case after 8s with a recovery screen; ideal
   is a 5s Layer-1-style local timer that forces
   `setIsLoading(false)` and lets the page self-heal without a tap.
2. **The three SW-hardening items the BE doctor flagged**
   (background.ts:14 `.catch`; contextMenus.ts:145-191
   `AbortController`; unified `chromeStorageGet<T>(keys, { timeoutMs })`
   helper). These reduce the *frequency* of the chrome.storage
   callback drop. Not on the critical path for this incident.
3. **A Layer-1 telemetry beacon** at the 5s `getUser` fallback site
   (separate from Layer 3's 8s `NewTabHangShown` at the recovery
   screen). Without it, the common recovery path is invisible in
   Amplitude and we can only see the worst-case 8s tail. Tiny diff.

**Two operator decisions worth surfacing on the published doc:**

- Whether to gate the `NewTabHangShown` beacon (and the
  Layer-1 hydration-timeout beacon, if added) behind a feature flag.
  Synthesis defaults to "on"; I concur.
- Whether to redeploy prod-api as part of this incident. Per the BE
  doctor and my own re-check: **no**. The API code is innocent and a
  redeploy is needless blast-radius.
