A
AIOS Wiki
read-only · public mirror
Open AIOS
Wikiartifactstoby-incident-validatora28a3690-38d7-4ce9-a9c2-c6d436da1793artifacts/toby-incident-validator/a28a3690-38d7-4ce9-a9c2-c6d436da1793/validation.md

Validation — blank-extension-page (2026-05-11)

Hand-authored·8 min read·7 sections·Last edited May 13 by initial import·View history

Verdict

validated

Confidence: high

The root-cause story is supported by the live code, the proximate commit exists with the claimed shape, and the backend's "innocent" verdict matches what I see in Cloud Run + the production DB right now. The proposed Layer 1 + Layer 2 + Layer 3 fix is correct, race-safe, preserves the original intent of d68726b29, and exposes nothing sensitive. Ship it. Two open follow-ups noted below — neither blocks closing this incident.

Spot-checks performed

  • Code, apps/extension/app/containers/Toby.tsx:304 — read directly. Exact text: if (isInitializing || !isDraftReady || !isUserHydrated) return null;. Matches the synthesis claim 1:1. AuthWrapper at 277-314 is exactly as the FE doctor described. Both <App> and <Onboarding2> are inside this wrapper, confirming a null return kills the entire visible UI.
  • Code, apps/extension/app/state/accessors/user.tsx:45-50 — read directly. getUser body is exactly:
    export const getUser = () =>
      new Promise<LoginResponse | null>((resolve) => {
        chrome.storage.local.get('user', ({ user }) => {
          resolve(user ?? null);
        });
      });
    
    Confirmed: no chrome.runtime.lastError check, no .catch, no timeout. UserProvider useEffect at 71-76 is also exactly as described — getUser().then(...) with no fallback. Unbounded surface is real.
  • Code, apps/extension/app/utils/chromeapi.ts:248-259 — confirms the same missing pattern in getChromeStorage, which onboarding2DraftStorage.get() is built on. The class of bug spans beyond getUser.
  • Code, apps/extension/entrypoints/toby/index.html:62-69 — preload skeleton is the static markup the user sees when #root has no children. Matches the FE doctor's repro shape 1:1.
  • Git, commit d68726b29 — exists, dated 2026-04-09T19:07:05 +03:00, author Jad Haidar, message "fix: gate AuthWrapper on user hydration to prevent duplicate onboarding events", touches only apps/extension/app/containers/Toby.tsx (+5 / -1). The diff adds the !isUserHydrated clause to the gate — exactly as the synthesis claims.
  • Cloud Run, gcloud run revisions list --service=prod-api --region=us-east4 — confirms the four most recent revisions reported by the backend doctor verbatim: 00427-9p2 (2026-04-01 20:08Z, ACTIVE) → 00426-s6l (2026-04-01 17:58Z) → 00425-x7f (2026-02-02 19:13Z) all on commit 4b0107858; 00424-dpx (2026-01-28 20:16Z) on the preceding SHA 0b44adc. The deployed SHA has not changed since 2026-02-02 — well outside the post-2026-04-09 user-complaint window. Backend cannot be the regression.
  • DB, Toby Prod RO — re-ran the 1-day active-user count: 41,630 users active in last 24 h vs. 41,578 reported by the BE doctor a few hours earlier. Healthy hot path; matches diurnal expectation.
  • Logging spot-checklist_log_entries returned a permission-denied for the validator's IAM scope (the BE doctor's account has broader log scope). I could not independently re-run the 5xx / ERROR-severity count. The Cloud Run revision evidence + healthy DB queries are strong-enough surrogates that I do not consider this a blocker for the verdict — but flagging it for transparency. Did not personally re-verify the "0 5xx in last 24h" claim; relying on BE doctor's citation plus the Cloud Run + DB cross-evidence.
  • Playwright synthetic repro — did not personally re-run it. The FE doctor's repro-blank-page.png and the static HTML at index.html:62-69 make the visual repro a one-step trivial mapping (DOM with .preloadedBg skeleton + empty #root). The accessibility snapshot being empty is mechanically a function of having no React children — I accept this without re-running.

Correctness audit

  • Does setIsUserHydrated(true) on timeout actually unblock the gate? Yes. !isUserHydrated becomes false; the gate collapses to isInitializing || !isDraftReady. If those are healthy, the AuthWrapper proceeds to render <App> or <Onboarding2>. The unblock works.
  • Race between timeout and legitimate .then resolution? Safe. When getUser resolves before 5s, .then runs synchronously to setIsUserHydrated(true) and .finally(() => clearTimeout(timeout)) runs in the same microtask flush. The 5s setTimeout (macrotask) is cancelled before it can fire. No double-fire on the happy path.
  • Race between timeout firing and a slow .then arriving after 5s? Layer 1 sets isUserHydrated=true at 5s; if the storage callback then fires at e.g. 5.5s, .then runs, cancelled is still false (mount lives), and setUser(user); setIsUserHydrated(true) runs — so a legitimate but slow user record IS still applied. The in-memory user is preserved (not clobbered). The non-destructive property is the key thing that makes this fix safe to ship.
  • cancelled flag semantics. The flag flips only on unmount. All three sites (timeout, .then, .catch) gate on it before calling state setters. Correctly prevents state updates after the component unmounts. React 18 StrictMode double-invoke is handled cleanly because each effect run has its own cancelled and its own timeout in closure.
  • Completeness gap (isInitializing). Layer 1 patches getUser and (per Apply the same shape to useOnboarding2Draft.ts:12-30) the draft path. It does NOT patch the third gate boolean isInitializing, which depends on useIsRestoring() from the react-query persistor (IDB-backed; same hang class — flagged by both the FE doctor at finding lines 99-101 and the BE doctor at finding lines 230-235). Layer 2's 8-second StuckRecoveryScreen is the backstop for this case, so the user is never stuck on a blank page — but they may have to tap a recovery button instead of the page self-healing at 5s. Acceptable for the close, but should be a follow-up to apply a useIsRestoring-flavoured Layer 1 patch (e.g. a local 5s timer in useHandleRedirectFromQueryParams that forces setIsLoading(false) even if isRestoringCachedData is still true). Listed under Recommendation, not under Objections.

Quality audit

  • Preserves d68726b29 intent on the happy path. When getUser resolves within 5s with a returning user, the gate behaves exactly as it does today: no flash of <Onboarding2>. The fix only changes behaviour in the failure case (chrome.storage callback dropped) and in the very-slow case (>5s) — and even the >5s case applies the user once it eventually arrives, so the "duplicate onboarding events" d68726b29 was protecting against can only occur in the precise scenario where storage is slower than 5s AND the resolved user exists. That's the genuine pathological state — flashing Onboarding2 briefly is the lesser evil over the blank-page hang.
  • Code style. Idiomatic React-18-era effect cleanup pattern. let cancelled + clearTimeout + chained .then/.catch/.finally is standard; the cleanup function correctly handles both timers and in-flight promise resolution.
  • Logging. console.warn on the fallback and console.error on the catch are appropriately verbose — they surface the rare-path behaviour without spamming the happy path.
  • Telemetry (Layer 3). Beacon fires at the Layer-2 8s site, not at the Layer-1 5s site. That means the Layer-1 fallback (which is the common recovery path post-fix) is invisible in Amplitude. Consider firing a second, lower-stakes event at the Layer-1 timeout site (e.g. NewTabHydrationTimeout). Polish, not a blocker.
  • Pre-approved copy. "Your tabs are safe. Tap to recover." is consistent with toby/strategy/playbook.md O1 KR1 and 00-state-of-the-project.md:50. Operator does not need a copy review.

Security audit

  • Does fail-open to "null user" expose sensitive data? No. The fallback only sets the React state isUserHydrated=true; it does NOT call setUser(null) (which would clobber chrome.storage). The in-memory user state remains null until either the slow storage callback eventually resolves OR the user signs in again through Onboarding2. Onboarding2 is fully public, pre-auth UI — there is nothing token-gated that renders before authentication.
  • Token / session integrity. The fix does not touch setUser, getToken, the OAuth callback handler, or any token-bearing request path. Auth posture is unchanged.
  • State that should NOT re-initialise. None. The fallback is purely the boolean gate flip; user, draft, queryClient cache, and chrome.storage values are untouched.
  • CSP / chrome.runtime.id / context invalidated. The fix responds to the context-invalidated state but does not attempt to recover from it (no chrome.runtime.id check, no force-reload). That's appropriate at this layer — Layer 2's "tap to recover" reload is the human escape. Whether to detect chrome.runtime.id === undefined and auto-reload is a product question for follow-up.
  • No new attack surface. No new code paths handling untrusted input. No CSP changes. No JWT / cookie handling. No secrets.
  • Security-review skill gate. The fix does NOT touch auth, sessions, tokens, secrets, JWT, or CSP, so the mandatory deep security-review skill does not need to spin up. The lighter review above is sufficient.

Objections (if rejected or conditional)

(None — verdict is validated.)

Recommendation

Coordinator should publish the incident doc with status: closed and ship the FE Layer 1 + Layer 2 + Layer 3 fix as the resolution. The synthesis is accurate, the fix is correct, race-safe, and non-destructive, and the backend's "innocent" finding holds up under independent re-check.

Three follow-ups to file as separate work — do NOT block closing this incident on them:

  1. Apply the same shape to isInitializing (specifically the useIsRestoring() IDB-backed path inside useHandleRedirectFromQueryParams at Toby.tsx:168-275). Today Layer 2 catches this case after 8s with a recovery screen; ideal is a 5s Layer-1-style local timer that forces setIsLoading(false) and lets the page self-heal without a tap.
  2. The three SW-hardening items the BE doctor flagged (background.ts:14 .catch; contextMenus.ts:145-191 AbortController; unified chromeStorageGet<T>(keys, { timeoutMs }) helper). These reduce the frequency of the chrome.storage callback drop. Not on the critical path for this incident.
  3. A Layer-1 telemetry beacon at the 5s getUser fallback site (separate from Layer 3's 8s NewTabHangShown at the recovery screen). Without it, the common recovery path is invisible in Amplitude and we can only see the worst-case 8s tail. Tiny diff.

Two operator decisions worth surfacing on the published doc:

  • Whether to gate the NewTabHangShown beacon (and the Layer-1 hydration-timeout beacon, if added) behind a feature flag. Synthesis defaults to "on"; I concur.
  • Whether to redeploy prod-api as part of this incident. Per the BE doctor and my own re-check: no. The API code is innocent and a redeploy is needless blast-radius.