artifacts/toby-incident-coordinator/df069a93-28df-4439-8838-cfd953c4c974/synthesis-draft.mdDraft synthesis — blank-extension-page (2026-05-11)
Summary
Toby's new-tab extension page hangs on the static preload skeleton, with no React content above it. Recurring symptom; affects every Toby user every time we ship an extension update (or whenever Chrome auto-updates the extension while the new-tab is open) — most don't see it because they don't reopen the new-tab; the ones who do reopen it report "blank page on infinite load".
Root cause
Proximate (frontend): apps/extension/app/containers/Toby.tsx:304 —
if (isInitializing || !isDraftReady || !isUserHydrated) return null;
returns null forever when isUserHydrated (added in commit d68726b29, 2026-04-09: "fix: gate AuthWrapper on user hydration to prevent duplicate onboarding events") never flips true. isUserHydrated is bound to a single unbounded promise in apps/extension/app/state/accessors/user.tsx:45-50:
export const getUser = () =>
new Promise<LoginResponse | null>((resolve) => {
chrome.storage.local.get('user', ({ user }) => {
resolve(user ?? null);
});
});
No timeout. No chrome.runtime.lastError check. No .catch(). If the chrome.storage.local.get callback never fires, the promise hangs forever and AuthWrapper returns null forever.
Distal (platform): Chrome's "extension context invalidated" state (renderer-side) drops chrome.storage.local.get callbacks. This state is entered when Chrome auto-updates the extension while a chrome_url_override new-tab is open (which is every Toby user, every release), when the user manually disables/re-enables the extension, or when the SW crashes during a critical handshake phase. This is a Chromium MV3 platform behaviour, not a Toby code regression.
Why now (post-2026-04-09): the underlying chrome.storage drop is an evergreen Chrome MV3 phenomenon — the extension used to accidentally tolerate it because the AuthWrapper gate only depended on isInitializing || !isDraftReady. Commit d68726b29 added !isUserHydrated to the gate, binding the rendered UI 1:1 to that unbounded callback. The widened gate is what turned a tolerable platform quirk into a reliable user-visible hang.
What this is NOT
The earlier toby-product-strategist hypothesis (artifact 388c1db4) that this was an MV3 service-worker boot regression is refuted:
- Prod-api SHA hasn't changed since 2026-02-02 (
commit-sha=4b0107858, three consecutive Cloud Run revisions on the same SHA; the 2026-04-01 deploys are config-only redeploys). - 5xx volume on prod-api last 24 h: 0. Worst day this week: 19 / 1.18M = 0.0016%.
- 23 ERROR-severity log entries in last 7 days, all expected 401s on stale-session endpoints. No panics. No fatals.
- DB healthy: 41,578 DAU, 720 new signups / 7 d, healthy diurnal curve.
- SW boot path is structurally clean: every
chrome.*.addListenerregisters synchronously at module top level. No listener-after-await MV3 boot bug. getUser()does NOT hit the network — the hang is pre-HTTP, so an API regression cannot be the cause.
Proposed fix (frontend, defence-in-depth)
Layer 1 — bound the hydration promises with a 5s timeout that fails open
apps/extension/app/state/accessors/user.tsx around line 71 (the useEffect that calls getUser()):
useEffect(() => {
let cancelled = false;
const timeout = setTimeout(() => {
if (!cancelled) {
console.warn('[toby] getUser() exceeded 5s; falling back to null user.');
setIsUserHydrated(true);
}
}, 5000);
getUser()
.then((user) => {
if (cancelled) return;
if (user) setUser(user);
setIsUserHydrated(true);
})
.catch((err) => {
console.error('[toby] getUser() failed:', err);
if (!cancelled) setIsUserHydrated(true);
})
.finally(() => clearTimeout(timeout));
return () => {
cancelled = true;
clearTimeout(timeout);
};
}, []);
Apply the same shape to apps/extension/app/hooks/useOnboarding2Draft.ts:12-30 for isDraftReady.
Layer 2 — replace return null with a visible escape hatch after 8s
apps/extension/app/containers/Toby.tsx:304:
const [showStuckEscapeHatch, setShowStuckEscapeHatch] = useState(false);
useEffect(() => {
if (!isInitializing && isDraftReady && isUserHydrated) return;
const t = setTimeout(() => setShowStuckEscapeHatch(true), 8000);
return () => clearTimeout(t);
}, [isInitializing, isDraftReady, isUserHydrated]);
if (isInitializing || !isDraftReady || !isUserHydrated) {
if (showStuckEscapeHatch) {
return <StuckRecoveryScreen onRetry={() => window.location.reload()} />;
}
return null;
}
Copy: "Your tabs are safe. Tap to recover." — already pre-approved per toby/00-state-of-the-project.md:50 and toby/strategy/playbook.md O1 KR1.
Layer 3 — telemetry beacon
At the setShowStuckEscapeHatch(true) site, fire trackEvent('NewTabHangShown', { isInitializing, isDraftReady, isUserHydrated, browser, version }). This finally gives us a signal between "user complains in CWS review" and our existing Sentry/Amplitude burn.
Backend hardening (follow-up, NOT required to close incident)
The Go API itself does not need a change. But the extension service worker has three unrelated fragility issues that, while they don't cause this bug, do make its underlying platform conditions more frequent. File as follow-ups, ship outside this incident:
- Catch the persist-restore rejection in
apps/extension/entrypoints/background.ts:14. Currently fire-and-forget; an IDB failure is silently swallowed. - AbortController on SW
fetchs inapps/extension/app/background/contextMenus.ts:145-191(10s timeout). Currently a stuck TCP socket can keep the SW alive past its idle window. - Unified
chromeStorageGet<T>(keys, { timeoutMs })helper that wrapschrome.runtime.lastErrorchecks +chrome.runtime.idvalidity + a timeout. Replace every rawchrome.storage.local.get(key, cb)callsite with this. The FE Layer 1 fix only patches the onegetUsersite; this helper would fix the class.
Verify plan
-
Manual repro (canonical scenario for chrome.storage drop):
cd apps/extension && pnpm install && pnpm dev- Load unpacked at
apps/extension/.output/chrome-mv3viachrome://extensions. - Open the new tab; confirm happy path renders.
- Toggle the extension off and back on in
chrome://extensions(this puts the open tab into the "context invalidated" state —chrome.runtime.id === undefined). - Reload the new tab. Pre-fix: blank skeleton forever. Post-fix: Onboarding2 (or App) renders after 5s, with the
[toby] getUser() exceeded 5sconsole warning.
-
Recovery-screen repro (Layer 2):
- In DevTools, monkey-patch
chrome.storage.local.get = () => {}before reloading the new-tab page. Pre-Layer-2: blank. Post-Layer-2: StuckRecoveryScreen renders after 8s with the "tap to recover" CTA.
- In DevTools, monkey-patch
-
Regression check (the d68726b29 bug must stay fixed): When
isUserHydratedlegitimately resolves with a pre-existing user before the 5s timeout, AuthWrapper must behave exactly as today — no flash of<Onboarding2>for returning users. -
Telemetry sanity (Layer 3): Confirm
NewTabHangShownevents flow into Amplitude. Establish baseline frequency in the first 7 days. If volume is non-trivial without a correlated prod-api 5xx spike, the platform-side chrome.storage drop hypothesis is confirmed. -
Backend monitoring (no action expected): Continue watching
prod-api5xx; expect to stay at ~0. If a 5xx spike correlates with a newNewTabHangShownspike, re-open backend investigation. With current cadence (one prod-api code change in 4 months) this is unlikely.
Open questions / operator decisions
- Defer SW hardening? The three backend-flagged hardening items are not required to close this incident. Ship FE Layer 1+2+3 first; queue SW hardening as a separate piece of work.
- Do we want the
NewTabHangShownbeacon gated on a feature flag? Default is to ship it on.
Citations
- Frontend finding:
artifacts/toby-frontend-doctor/6e2b3eb9-36bf-42d3-8de3-5afa48f4b167/finding.md(run id6e2b3eb9-36bf-42d3-8de3-5afa48f4b167). - Backend finding:
artifacts/toby-backend-doctor/083ec6d2-63e9-4c3e-b55e-a95301a4aa72/finding.md(run id083ec6d2-63e9-4c3e-b55e-a95301a4aa72). - Frontend Playwright repro screenshot: same folder,
repro-blank-page.png. - Proximate code site:
apps/extension/app/containers/Toby.tsx:304. - Proximate hydration site:
apps/extension/app/state/accessors/user.tsx:45-50, 66-99. - Proximate commit:
d68726b29(2026-04-09). - Prior strategist hypothesis (REFUTED): artifact
388c1db4-59b7-49e9-8ec3-ecfba972c95f.