A
AIOS Wiki
read-only · public mirror
Open AIOS
Wikiartifactstoby-incident-coordinatordf069a93-28df-4439-8838-cfd953c4c974artifacts/toby-incident-coordinator/df069a93-28df-4439-8838-cfd953c4c974/synthesis-draft.md

Draft synthesis — blank-extension-page (2026-05-11)

Hand-authored·5 min read·11 sections·Last edited May 13 by initial import·View history

Summary

Toby's new-tab extension page hangs on the static preload skeleton, with no React content above it. Recurring symptom; affects every Toby user every time we ship an extension update (or whenever Chrome auto-updates the extension while the new-tab is open) — most don't see it because they don't reopen the new-tab; the ones who do reopen it report "blank page on infinite load".

Root cause

Proximate (frontend): apps/extension/app/containers/Toby.tsx:304

if (isInitializing || !isDraftReady || !isUserHydrated) return null;

returns null forever when isUserHydrated (added in commit d68726b29, 2026-04-09: "fix: gate AuthWrapper on user hydration to prevent duplicate onboarding events") never flips true. isUserHydrated is bound to a single unbounded promise in apps/extension/app/state/accessors/user.tsx:45-50:

export const getUser = () =>
  new Promise<LoginResponse | null>((resolve) => {
    chrome.storage.local.get('user', ({ user }) => {
      resolve(user ?? null);
    });
  });

No timeout. No chrome.runtime.lastError check. No .catch(). If the chrome.storage.local.get callback never fires, the promise hangs forever and AuthWrapper returns null forever.

Distal (platform): Chrome's "extension context invalidated" state (renderer-side) drops chrome.storage.local.get callbacks. This state is entered when Chrome auto-updates the extension while a chrome_url_override new-tab is open (which is every Toby user, every release), when the user manually disables/re-enables the extension, or when the SW crashes during a critical handshake phase. This is a Chromium MV3 platform behaviour, not a Toby code regression.

Why now (post-2026-04-09): the underlying chrome.storage drop is an evergreen Chrome MV3 phenomenon — the extension used to accidentally tolerate it because the AuthWrapper gate only depended on isInitializing || !isDraftReady. Commit d68726b29 added !isUserHydrated to the gate, binding the rendered UI 1:1 to that unbounded callback. The widened gate is what turned a tolerable platform quirk into a reliable user-visible hang.

What this is NOT

The earlier toby-product-strategist hypothesis (artifact 388c1db4) that this was an MV3 service-worker boot regression is refuted:

  • Prod-api SHA hasn't changed since 2026-02-02 (commit-sha=4b0107858, three consecutive Cloud Run revisions on the same SHA; the 2026-04-01 deploys are config-only redeploys).
  • 5xx volume on prod-api last 24 h: 0. Worst day this week: 19 / 1.18M = 0.0016%.
  • 23 ERROR-severity log entries in last 7 days, all expected 401s on stale-session endpoints. No panics. No fatals.
  • DB healthy: 41,578 DAU, 720 new signups / 7 d, healthy diurnal curve.
  • SW boot path is structurally clean: every chrome.*.addListener registers synchronously at module top level. No listener-after-await MV3 boot bug.
  • getUser() does NOT hit the network — the hang is pre-HTTP, so an API regression cannot be the cause.

Proposed fix (frontend, defence-in-depth)

Layer 1 — bound the hydration promises with a 5s timeout that fails open

apps/extension/app/state/accessors/user.tsx around line 71 (the useEffect that calls getUser()):

useEffect(() => {
  let cancelled = false;
  const timeout = setTimeout(() => {
    if (!cancelled) {
      console.warn('[toby] getUser() exceeded 5s; falling back to null user.');
      setIsUserHydrated(true);
    }
  }, 5000);

  getUser()
    .then((user) => {
      if (cancelled) return;
      if (user) setUser(user);
      setIsUserHydrated(true);
    })
    .catch((err) => {
      console.error('[toby] getUser() failed:', err);
      if (!cancelled) setIsUserHydrated(true);
    })
    .finally(() => clearTimeout(timeout));

  return () => {
    cancelled = true;
    clearTimeout(timeout);
  };
}, []);

Apply the same shape to apps/extension/app/hooks/useOnboarding2Draft.ts:12-30 for isDraftReady.

Layer 2 — replace return null with a visible escape hatch after 8s

apps/extension/app/containers/Toby.tsx:304:

const [showStuckEscapeHatch, setShowStuckEscapeHatch] = useState(false);

useEffect(() => {
  if (!isInitializing && isDraftReady && isUserHydrated) return;
  const t = setTimeout(() => setShowStuckEscapeHatch(true), 8000);
  return () => clearTimeout(t);
}, [isInitializing, isDraftReady, isUserHydrated]);

if (isInitializing || !isDraftReady || !isUserHydrated) {
  if (showStuckEscapeHatch) {
    return <StuckRecoveryScreen onRetry={() => window.location.reload()} />;
  }
  return null;
}

Copy: "Your tabs are safe. Tap to recover." — already pre-approved per toby/00-state-of-the-project.md:50 and toby/strategy/playbook.md O1 KR1.

Layer 3 — telemetry beacon

At the setShowStuckEscapeHatch(true) site, fire trackEvent('NewTabHangShown', { isInitializing, isDraftReady, isUserHydrated, browser, version }). This finally gives us a signal between "user complains in CWS review" and our existing Sentry/Amplitude burn.

Backend hardening (follow-up, NOT required to close incident)

The Go API itself does not need a change. But the extension service worker has three unrelated fragility issues that, while they don't cause this bug, do make its underlying platform conditions more frequent. File as follow-ups, ship outside this incident:

  1. Catch the persist-restore rejection in apps/extension/entrypoints/background.ts:14. Currently fire-and-forget; an IDB failure is silently swallowed.
  2. AbortController on SW fetchs in apps/extension/app/background/contextMenus.ts:145-191 (10s timeout). Currently a stuck TCP socket can keep the SW alive past its idle window.
  3. Unified chromeStorageGet<T>(keys, { timeoutMs }) helper that wraps chrome.runtime.lastError checks + chrome.runtime.id validity + a timeout. Replace every raw chrome.storage.local.get(key, cb) callsite with this. The FE Layer 1 fix only patches the one getUser site; this helper would fix the class.

Verify plan

  1. Manual repro (canonical scenario for chrome.storage drop):

    1. cd apps/extension && pnpm install && pnpm dev
    2. Load unpacked at apps/extension/.output/chrome-mv3 via chrome://extensions.
    3. Open the new tab; confirm happy path renders.
    4. Toggle the extension off and back on in chrome://extensions (this puts the open tab into the "context invalidated" state — chrome.runtime.id === undefined).
    5. Reload the new tab. Pre-fix: blank skeleton forever. Post-fix: Onboarding2 (or App) renders after 5s, with the [toby] getUser() exceeded 5s console warning.
  2. Recovery-screen repro (Layer 2):

    • In DevTools, monkey-patch chrome.storage.local.get = () => {} before reloading the new-tab page. Pre-Layer-2: blank. Post-Layer-2: StuckRecoveryScreen renders after 8s with the "tap to recover" CTA.
  3. Regression check (the d68726b29 bug must stay fixed): When isUserHydrated legitimately resolves with a pre-existing user before the 5s timeout, AuthWrapper must behave exactly as today — no flash of <Onboarding2> for returning users.

  4. Telemetry sanity (Layer 3): Confirm NewTabHangShown events flow into Amplitude. Establish baseline frequency in the first 7 days. If volume is non-trivial without a correlated prod-api 5xx spike, the platform-side chrome.storage drop hypothesis is confirmed.

  5. Backend monitoring (no action expected): Continue watching prod-api 5xx; expect to stay at ~0. If a 5xx spike correlates with a new NewTabHangShown spike, re-open backend investigation. With current cadence (one prod-api code change in 4 months) this is unlikely.

Open questions / operator decisions

  • Defer SW hardening? The three backend-flagged hardening items are not required to close this incident. Ship FE Layer 1+2+3 first; queue SW hardening as a separate piece of work.
  • Do we want the NewTabHangShown beacon gated on a feature flag? Default is to ship it on.

Citations

  • Frontend finding: artifacts/toby-frontend-doctor/6e2b3eb9-36bf-42d3-8de3-5afa48f4b167/finding.md (run id 6e2b3eb9-36bf-42d3-8de3-5afa48f4b167).
  • Backend finding: artifacts/toby-backend-doctor/083ec6d2-63e9-4c3e-b55e-a95301a4aa72/finding.md (run id 083ec6d2-63e9-4c3e-b55e-a95301a4aa72).
  • Frontend Playwright repro screenshot: same folder, repro-blank-page.png.
  • Proximate code site: apps/extension/app/containers/Toby.tsx:304.
  • Proximate hydration site: apps/extension/app/state/accessors/user.tsx:45-50, 66-99.
  • Proximate commit: d68726b29 (2026-04-09).
  • Prior strategist hypothesis (REFUTED): artifact 388c1db4-59b7-49e9-8ec3-ecfba972c95f.