A
AIOS Wiki
read-only · public mirror
Open AIOS
Wikitobyincidentstoby/incidents/2026-05-11-blank-extension-page.md

Blank extension page — infinite-load hang on new tab

Hand-authored·9 min read·18 sections·Last edited May 13 by agent (MCP)·View history

TL;DR

The Toby new-tab page renders the static preload skeleton and never transitions to the real UI. AuthWrapper at apps/extension/app/containers/Toby.tsx:304 returns null forever because isUserHydrated (added 2026-04-09 in commit d68726b29) never flips. The proximate cause is that getUser() at apps/extension/app/state/accessors/user.tsx:45-50 has no timeout / no chrome.runtime.lastError check / no .catch; when Chrome's "extension context invalidated" state drops the chrome.storage.local.get callback (which happens on every extension auto-update), the promise hangs forever.

Backend (Go API) is innocent. Prod-api SHA hasn't changed since 2026-02-02; 0 5xx in last 24h; SW boot path is structurally clean. The earlier toby-product-strategist MV3-SW-boot-regression hypothesis (388c1db4) is refuted.

Fix (frontend only, defence-in-depth): bound getUser() and getOnboarding2Draft with 5s timeouts that fail open; replace return null with a visible recovery screen after 8s; instrument with a NewTabHangShown beacon.

Status: closed → shipped (PR https://github.com/axiomzen/toby-mono-repo/pull/12, 2026-05-13).

Symptom

  • New-tab page loads, shows the static .preloadedBg skeleton (light-grey sidebar columns, circle avatar, center card-grid background).
  • Never transitions. #root exists in the DOM but has zero children.
  • Accessibility tree is empty; users see only decorative CSS.
  • Console is silent. No JS error, no failed network call. That's why users describe it as "infinite loading", not "crash".
  • Recurring across Chrome / Vivaldi 7.0 / Orion based on CWS reviews and forum reports.

Root cause

Proximate (frontend)

apps/extension/app/containers/Toby.tsx:304:

if (isInitializing || !isDraftReady || !isUserHydrated) return null;

returns null forever when any of the three booleans never flips true. <App> and <Onboarding2> are both inside this wrapper, so a null return kills the entire visible UI.

The third boolean — isUserHydrated — was added to this gate in commit d68726b29 (Jad Haidar, 2026-04-09 19:07 +03:00, "fix: gate AuthWrapper on user hydration to prevent duplicate onboarding events"). That fix solves a real bug (returning users briefly seeing the onboarding flow), but it widened the failure surface without bounding the new dependency.

isUserHydrated is bound 1:1 to a single unbounded promise at apps/extension/app/state/accessors/user.tsx:45-50:

export const getUser = () =>
  new Promise<LoginResponse | null>((resolve) => {
    chrome.storage.local.get('user', ({ user }) => {
      resolve(user ?? null);
    });
  });

No timeout. No chrome.runtime.lastError check. No .catch(). The same defect exists in apps/extension/app/utils/chromeapi.ts:248-259 (which useOnboarding2Draft is built on), so isDraftReady is exposed to the same class of bug. isInitializing waits on useIsRestoring() from the react-query persistor (IDB-backed), with a parallel silent-hang surface.

Distal (Chrome MV3 platform)

Chrome's "extension context invalidated" state (renderer-side) drops chrome.storage.local.get callbacks. This state is entered when:

  • Chrome auto-updates the extension while a chrome_url_override new-tab is open (this happens to every Toby user, every release — Toby owns the new-tab page).
  • The user manually disables/re-enables or reloads the extension in chrome://extensions.
  • The SW crashes during a critical handshake phase (rarer).

This is a Chromium MV3 platform behaviour, not a Toby code regression.

Why now (post-2026-04-09)

The underlying chrome.storage callback drop is evergreen. The extension used to accidentally tolerate it because the AuthWrapper gate only depended on isInitializing || !isDraftReady. Commit d68726b29 added !isUserHydrated, binding the rendered UI 1:1 to that one unbounded callback. The widened gate is what turned a tolerable platform quirk into a reliable user-visible hang.

What this is NOT

The earlier toby-product-strategist hypothesis (artifact 388c1db4-59b7-49e9-8ec3-ecfba972c95f) that this was an MV3 service-worker boot regression is refuted by independent backend evidence (validator re-checked):

ProbeEvidence
Prod-api SHA stability4b0107858e706c904e6cf2841fbcbf81a1e2f94f has been the active SHA on three consecutive Cloud Run revisions (00425, 00426, 00427) since 2026-02-02 — well before the post-2026-04-09 user-complaint window. The 2026-04-01 deploys are config-only redeploys.
5xx volume0 in last 24h. Worst day this week: 19 / 1.18M = 0.0016%.
ERROR severity logs23 entries in last 7 days; 22 are expected 401s on stale-session endpoints, 1 is a downstream toby-ai-api 500 not on the auth path. No panics. No fatals.
DB health41,578 DAU, 720 new signups / 7d, healthy diurnal curve, peak 3,352 active-this-hour.
SW boot pathEvery chrome.*.addListener registers synchronously at module top level. No listener-after-await MV3 boot bug.
Network in hang pathgetUser() does not make a network call — the hang is pre-HTTP. An API regression structurally cannot cause this.

Fix

Layer 1 — bound the hydration promises with a 5s timeout (fail open)

apps/extension/app/state/accessors/user.tsx, around line 71 (the useEffect that calls getUser):

useEffect(() => {
  let cancelled = false;
  const timeout = setTimeout(() => {
    if (!cancelled) {
      console.warn('[toby] getUser() exceeded 5s; falling back to null user.');
      setIsUserHydrated(true);
    }
  }, 5000);

  getUser()
    .then((user) => {
      if (cancelled) return;
      if (user) setUser(user);
      setIsUserHydrated(true);
    })
    .catch((err) => {
      console.error('[toby] getUser() failed:', err);
      if (!cancelled) setIsUserHydrated(true);
    })
    .finally(() => clearTimeout(timeout));

  return () => {
    cancelled = true;
    clearTimeout(timeout);
  };
}, []);

Apply the same shape to apps/extension/app/hooks/useOnboarding2Draft.ts:12-30 for isDraftReady.

Validator confirmed:

  • Race-safe. On the happy path, .finally(clearTimeout) runs in the microtask flush before the 5s macrotask can fire.
  • Non-destructive. When the storage callback arrives slow (>5s), .then still applies the user record once it eventually resolves — the in-memory user is not clobbered.
  • Preserves d68726b29 intent. Returning users with healthy storage never see an Onboarding2 flash.

Layer 2 — visible recovery screen after 8s

apps/extension/app/containers/Toby.tsx:304:

const [showStuckEscapeHatch, setShowStuckEscapeHatch] = useState(false);

useEffect(() => {
  if (!isInitializing && isDraftReady && isUserHydrated) return;
  const t = setTimeout(() => setShowStuckEscapeHatch(true), 8000);
  return () => clearTimeout(t);
}, [isInitializing, isDraftReady, isUserHydrated]);

if (isInitializing || !isDraftReady || !isUserHydrated) {
  if (showStuckEscapeHatch) {
    return <StuckRecoveryScreen onRetry={() => window.location.reload()} />;
  }
  return null;
}

Copy: "Your tabs are safe. Tap to recover." — pre-approved per toby/00-state-of-the-project.md:50 and toby/strategy/playbook.md O1 KR1.

Layer 3 — telemetry beacon

At the setShowStuckEscapeHatch(true) site:

trackEvent('NewTabHangShown', {
  isInitializing,
  isDraftReady,
  isUserHydrated,
  browser,
  version,
});

Establishes the first signal we have between "user complains in CWS review" and the existing Sentry / Amplitude funnels.

Verify plan

  1. Manual repro (canonical scenario):

    1. cd apps/extension && pnpm install && pnpm dev
    2. Load unpacked at apps/extension/.output/chrome-mv3 via chrome://extensions.
    3. Open the new tab; confirm happy path renders.
    4. Toggle the extension off and back on in chrome://extensions. The open tab now has chrome.runtime.id === undefined (the canonical context-invalidated state).
    5. Reload the new tab. Pre-fix: blank skeleton forever. Post-fix: Onboarding2 (or App) renders after 5s with [toby] getUser() exceeded 5s in the console.
  2. Recovery-screen repro (Layer 2): In DevTools, before reload: chrome.storage.local.get = () => {}. Reload. Pre-Layer-2: blank. Post-Layer-2: StuckRecoveryScreen renders at 8s with the "tap to recover" CTA.

  3. Regression check (d68726b29 must remain fixed): When isUserHydrated legitimately resolves with a returning user before 5s, AuthWrapper must behave exactly as today — no flash of <Onboarding2>.

  4. Telemetry sanity: Confirm NewTabHangShown flows into Amplitude. Establish baseline frequency in the first 7 days. If volume is non-trivial without a correlated prod-api 5xx spike, the platform-side chrome.storage drop hypothesis is confirmed in prod.

  5. Backend monitoring (no action expected): Watch prod-api 5xx; expect to stay at ~0. If a 5xx spike correlates with a NewTabHangShown spike, re-open backend investigation. With current cadence (one prod-api code change in 4 months) this is unlikely.

Operator decisions to surface

  • Should NewTabHangShown (and the optional Layer-1 hydration-timeout beacon below) be gated behind a feature flag? Default proposal is on. Validator concurs.
  • Should we redeploy prod-api as part of this incident? Both backend doctor and validator: no. The API code is innocent; a redeploy is needless blast-radius.

Follow-ups (NOT blockers for closing this incident)

  1. Apply Layer-1 shape to isInitializing. Specifically the useIsRestoring() IDB-backed path inside useHandleRedirectFromQueryParams at Toby.tsx:168-275. Today Layer 2 catches this case at 8s with a recovery screen; the ideal is a 5s Layer-1-style local timer that lets the page self-heal without a tap.
  2. SW hardening (three items flagged by the backend doctor):
    • apps/extension/entrypoints/background.ts:14 — chain .catch(err => console.error('[toby-sw] persistQueryClientRestore failed', err)) on the persist-restore call. Currently fire-and-forget; IDB failures are silently swallowed.
    • apps/extension/app/background/contextMenus.ts:145-191 — wrap the SW fetch calls with an AbortController + 10s timeout. Currently a stuck TCP socket can keep the SW alive past its idle window.
    • Build a unified chromeStorageGet<T>(keys, { timeoutMs }) helper that wraps chrome.runtime.lastError + chrome.runtime.id validity + a timeout. Replace every raw chrome.storage.local.get(key, cb) callsite with this. Layer 1 only patches the getUser and getOnboarding2Draft sites; this helper would fix the class.
  3. Layer-1 telemetry beacon. Fire a second, lower-stakes event (e.g. NewTabHydrationTimeout) at the Layer-1 5s fallback site. Without it, the common post-fix recovery path is invisible in Amplitude — we'd only see the 8s worst-case tail.

Open questions

None blocking. Operator decisions above are explicit choices, not unknowns.

Citations

  • Frontend finding: artifacts/toby-frontend-doctor/6e2b3eb9-36bf-42d3-8de3-5afa48f4b167/finding.md (run id 6e2b3eb9-36bf-42d3-8de3-5afa48f4b167; Playwright screenshot + synthetic HTML alongside).
  • Backend finding: artifacts/toby-backend-doctor/083ec6d2-63e9-4c3e-b55e-a95301a4aa72/finding.md (run id 083ec6d2-63e9-4c3e-b55e-a95301a4aa72).
  • Validation: artifacts/toby-incident-validator/a28a3690-38d7-4ce9-a9c2-c6d436da1793/validation.md (run id a28a3690-38d7-4ce9-a9c2-c6d436da1793).
  • Synthesis draft (preserved): artifacts/toby-incident-coordinator/df069a93-28df-4439-8838-cfd953c4c974/synthesis-draft.md.
  • Ship result: artifacts/toby-incident-fix-shipper/b3400d87-0830-4f89-bb70-4c3907c085f1/ship-result.md (run id b3400d87-0830-4f89-bb70-4c3907c085f1).
  • Proximate code site: apps/extension/app/containers/Toby.tsx:304 and apps/extension/app/state/accessors/user.tsx:45-50,66-99.
  • Class-of-bug code site: apps/extension/app/utils/chromeapi.ts:248-259 (same missing-lastError/timeout pattern in getChromeStorage).
  • Proximate commit: d68726b29 (Jad Haidar, 2026-04-09, +5/-1 to Toby.tsx).
  • Prior strategist hypothesis (REFUTED): artifact 388c1db4-59b7-49e9-8ec3-ecfba972c95f.

Timeline

Time (UTC)Event
2026-04-09 16:07Commit d68726b29 lands. AuthWrapper gate widened to depend on !isUserHydrated.
post-2026-04-09User complaints about "blank page on infinite load" begin (CWS reviews, Vivaldi 7.0 / Orion forums).
2026-05-11 17:08Incident dispatched to warroom.
2026-05-11 17:14Frontend doctor reports proximate cause + defers to backend.
2026-05-11 17:30Backend doctor refutes SW-boot-regression hypothesis with prod-api / DB / SW-boot evidence.
2026-05-11 17:36Validator confirms synthesis with validated / high confidence.
2026-05-11 17:38Incident doc published. Status: closed, fix queued for implementation by the operator.
2026-05-13 04:59Bridge files TOBY-14 ("Ship the blank-extension-page reliability hotfix") into the warroom inbox.
2026-05-13 05:08Fix-shipper opens PR https://github.com/axiomzen/toby-mono-repo/pull/12. Status: shipped.

PR shipped

  • PR URL: https://github.com/axiomzen/toby-mono-repo/pull/12
  • Branch: warroom/2026-05-11-blank-extension-page-toby-14
  • Commit: 06baf0f8a (base 75a09e34d on origin/main)
  • Source ticket: TOBY-14
  • Shipper run: b3400d87-0830-4f89-bb70-4c3907c085f1 (artifact: artifacts/toby-incident-fix-shipper/b3400d87-0830-4f89-bb70-4c3907c085f1/ship-result.md)
  • Files touched:
    • apps/extension/app/state/accessors/user.tsx (Layer 1 — 5s timeout fail-open on getUser())
    • apps/extension/app/hooks/useOnboarding2Draft.ts (Layer 1 — same shape on isReady)
    • apps/extension/app/containers/Toby.tsx (Layer 2 + Layer 3 — StuckRecoveryScreen at 8s + NewTabHangShown beacon)
    • apps/extension/app/components/StuckRecoveryScreen.tsx (new component, copy "Your tabs are safe. Tap to recover.")
  • What was deliberately NOT included (per the "Follow-ups" section): Layer-1 shape for isInitializing, the SW-hardening trio, the Layer-1 telemetry beacon. Those remain queued as separate work.
  • Verify plan: the doc's "Verify plan" section above is the canonical checklist for CI + reviewer manual repro. No local typecheck/lint ran in the ephemeral worktree (no node_modules); the diff is minimal and additive, relying on CI for the full check.