artifacts/toby-incident-validator/a28a3690-38d7-4ce9-a9c2-c6d436da1793/validation.mdValidation — blank-extension-page (2026-05-11)
Verdict
validated
Confidence: high
The root-cause story is supported by the live code, the proximate commit
exists with the claimed shape, and the backend's "innocent" verdict
matches what I see in Cloud Run + the production DB right now. The
proposed Layer 1 + Layer 2 + Layer 3 fix is correct, race-safe,
preserves the original intent of d68726b29, and exposes nothing
sensitive. Ship it. Two open follow-ups noted below — neither blocks
closing this incident.
Spot-checks performed
- Code,
apps/extension/app/containers/Toby.tsx:304— read directly. Exact text:if (isInitializing || !isDraftReady || !isUserHydrated) return null;. Matches the synthesis claim 1:1. AuthWrapper at 277-314 is exactly as the FE doctor described. Both<App>and<Onboarding2>are inside this wrapper, confirming a null return kills the entire visible UI. - Code,
apps/extension/app/state/accessors/user.tsx:45-50— read directly.getUserbody is exactly:Confirmed: noexport const getUser = () => new Promise<LoginResponse | null>((resolve) => { chrome.storage.local.get('user', ({ user }) => { resolve(user ?? null); }); });chrome.runtime.lastErrorcheck, no.catch, no timeout.UserProvideruseEffect at 71-76 is also exactly as described —getUser().then(...)with no fallback. Unbounded surface is real. - Code,
apps/extension/app/utils/chromeapi.ts:248-259— confirms the same missing pattern ingetChromeStorage, whichonboarding2DraftStorage.get()is built on. The class of bug spans beyondgetUser. - Code,
apps/extension/entrypoints/toby/index.html:62-69— preload skeleton is the static markup the user sees when#roothas no children. Matches the FE doctor's repro shape 1:1. - Git, commit
d68726b29— exists, dated2026-04-09T19:07:05 +03:00, author Jad Haidar, message "fix: gate AuthWrapper on user hydration to prevent duplicate onboarding events", touches onlyapps/extension/app/containers/Toby.tsx(+5 / -1). The diff adds the!isUserHydratedclause to the gate — exactly as the synthesis claims. - Cloud Run,
gcloud run revisions list --service=prod-api --region=us-east4— confirms the four most recent revisions reported by the backend doctor verbatim: 00427-9p2 (2026-04-01 20:08Z, ACTIVE) → 00426-s6l (2026-04-01 17:58Z) → 00425-x7f (2026-02-02 19:13Z) all on commit4b0107858; 00424-dpx (2026-01-28 20:16Z) on the preceding SHA0b44adc. The deployed SHA has not changed since 2026-02-02 — well outside the post-2026-04-09 user-complaint window. Backend cannot be the regression. - DB, Toby Prod RO — re-ran the 1-day active-user count:
41,630users active in last 24 h vs.41,578reported by the BE doctor a few hours earlier. Healthy hot path; matches diurnal expectation. - Logging spot-check —
list_log_entriesreturned a permission-denied for the validator's IAM scope (the BE doctor's account has broader log scope). I could not independently re-run the 5xx / ERROR-severity count. The Cloud Run revision evidence + healthy DB queries are strong-enough surrogates that I do not consider this a blocker for the verdict — but flagging it for transparency. Did not personally re-verify the "0 5xx in last 24h" claim; relying on BE doctor's citation plus the Cloud Run + DB cross-evidence. - Playwright synthetic repro — did not personally re-run it. The FE
doctor's
repro-blank-page.pngand the static HTML atindex.html:62-69make the visual repro a one-step trivial mapping (DOM with.preloadedBgskeleton + empty#root). The accessibility snapshot being empty is mechanically a function of having no React children — I accept this without re-running.
Correctness audit
- Does
setIsUserHydrated(true)on timeout actually unblock the gate? Yes.!isUserHydratedbecomesfalse; the gate collapses toisInitializing || !isDraftReady. If those are healthy, the AuthWrapper proceeds to render<App>or<Onboarding2>. The unblock works. - Race between timeout and legitimate
.thenresolution? Safe. WhengetUserresolves before 5s,.thenruns synchronously tosetIsUserHydrated(true)and.finally(() => clearTimeout(timeout))runs in the same microtask flush. The 5ssetTimeout(macrotask) is cancelled before it can fire. No double-fire on the happy path. - Race between timeout firing and a slow
.thenarriving after 5s? Layer 1 setsisUserHydrated=trueat 5s; if the storage callback then fires at e.g. 5.5s,.thenruns,cancelledis still false (mount lives), andsetUser(user); setIsUserHydrated(true)runs — so a legitimate but slow user record IS still applied. The in-memory user is preserved (not clobbered). The non-destructive property is the key thing that makes this fix safe to ship. cancelledflag semantics. The flag flips only on unmount. All three sites (timeout,.then,.catch) gate on it before calling state setters. Correctly prevents state updates after the component unmounts. React 18 StrictMode double-invoke is handled cleanly because each effect run has its owncancelledand its owntimeoutin closure.- Completeness gap (isInitializing). Layer 1 patches
getUserand (perApply the same shape to useOnboarding2Draft.ts:12-30) the draft path. It does NOT patch the third gate booleanisInitializing, which depends onuseIsRestoring()from the react-query persistor (IDB-backed; same hang class — flagged by both the FE doctor at finding lines 99-101 and the BE doctor at finding lines 230-235). Layer 2's 8-secondStuckRecoveryScreenis the backstop for this case, so the user is never stuck on a blank page — but they may have to tap a recovery button instead of the page self-healing at 5s. Acceptable for the close, but should be a follow-up to apply auseIsRestoring-flavoured Layer 1 patch (e.g. a local 5s timer inuseHandleRedirectFromQueryParamsthat forcessetIsLoading(false)even ifisRestoringCachedDatais still true). Listed under Recommendation, not under Objections.
Quality audit
- Preserves
d68726b29intent on the happy path. WhengetUserresolves within 5s with a returning user, the gate behaves exactly as it does today: no flash of<Onboarding2>. The fix only changes behaviour in the failure case (chrome.storage callback dropped) and in the very-slow case (>5s) — and even the >5s case applies the user once it eventually arrives, so the "duplicate onboarding events" d68726b29 was protecting against can only occur in the precise scenario where storage is slower than 5s AND the resolved user exists. That's the genuine pathological state — flashing Onboarding2 briefly is the lesser evil over the blank-page hang. - Code style. Idiomatic React-18-era effect cleanup pattern.
let cancelled+clearTimeout+ chained.then/.catch/.finallyis standard; the cleanup function correctly handles both timers and in-flight promise resolution. - Logging.
console.warnon the fallback andconsole.erroron the catch are appropriately verbose — they surface the rare-path behaviour without spamming the happy path. - Telemetry (Layer 3). Beacon fires at the Layer-2 8s site, not
at the Layer-1 5s site. That means the Layer-1 fallback (which is
the common recovery path post-fix) is invisible in Amplitude.
Consider firing a second, lower-stakes event at the Layer-1 timeout
site (e.g.
NewTabHydrationTimeout). Polish, not a blocker. - Pre-approved copy. "Your tabs are safe. Tap to recover." is
consistent with
toby/strategy/playbook.mdO1 KR1 and00-state-of-the-project.md:50. Operator does not need a copy review.
Security audit
- Does fail-open to "null user" expose sensitive data? No. The
fallback only sets the React state
isUserHydrated=true; it does NOT callsetUser(null)(which would clobber chrome.storage). The in-memoryuserstate remainsnulluntil either the slow storage callback eventually resolves OR the user signs in again through Onboarding2. Onboarding2 is fully public, pre-auth UI — there is nothing token-gated that renders before authentication. - Token / session integrity. The fix does not touch
setUser,getToken, the OAuth callback handler, or any token-bearing request path. Auth posture is unchanged. - State that should NOT re-initialise. None. The fallback is purely the boolean gate flip; user, draft, queryClient cache, and chrome.storage values are untouched.
- CSP /
chrome.runtime.id/ context invalidated. The fix responds to the context-invalidated state but does not attempt to recover from it (nochrome.runtime.idcheck, no force-reload). That's appropriate at this layer — Layer 2's "tap to recover" reload is the human escape. Whether to detectchrome.runtime.id === undefinedand auto-reload is a product question for follow-up. - No new attack surface. No new code paths handling untrusted input. No CSP changes. No JWT / cookie handling. No secrets.
- Security-review skill gate. The fix does NOT touch auth, sessions, tokens, secrets, JWT, or CSP, so the mandatory deep security-review skill does not need to spin up. The lighter review above is sufficient.
Objections (if rejected or conditional)
(None — verdict is validated.)
Recommendation
Coordinator should publish the incident doc with status: closed
and ship the FE Layer 1 + Layer 2 + Layer 3 fix as the resolution.
The synthesis is accurate, the fix is correct, race-safe, and
non-destructive, and the backend's "innocent" finding holds up under
independent re-check.
Three follow-ups to file as separate work — do NOT block closing this incident on them:
- Apply the same shape to
isInitializing(specifically theuseIsRestoring()IDB-backed path insideuseHandleRedirectFromQueryParamsatToby.tsx:168-275). Today Layer 2 catches this case after 8s with a recovery screen; ideal is a 5s Layer-1-style local timer that forcessetIsLoading(false)and lets the page self-heal without a tap. - The three SW-hardening items the BE doctor flagged
(background.ts:14
.catch; contextMenus.ts:145-191AbortController; unifiedchromeStorageGet<T>(keys, { timeoutMs })helper). These reduce the frequency of the chrome.storage callback drop. Not on the critical path for this incident. - A Layer-1 telemetry beacon at the 5s
getUserfallback site (separate from Layer 3's 8sNewTabHangShownat the recovery screen). Without it, the common recovery path is invisible in Amplitude and we can only see the worst-case 8s tail. Tiny diff.
Two operator decisions worth surfacing on the published doc:
- Whether to gate the
NewTabHangShownbeacon (and the Layer-1 hydration-timeout beacon, if added) behind a feature flag. Synthesis defaults to "on"; I concur. - Whether to redeploy prod-api as part of this incident. Per the BE doctor and my own re-check: no. The API code is innocent and a redeploy is needless blast-radius.