Toby code review · 2026-05-13

What shipped

d68726b fix: gate AuthWrapper on user hydration to prevent duplicate onboarding events · @Jad Haidar
9d6e8e4 chore: update extension version to 1.13.0 · @Jad Haidar
b1fe667 docs: add BigQuery schema, connection IDs, and data caveats to analytics skill (#9) · @Jad Haidar
406849 docs: add phase 1 and phase 2 specs, plans, and todos for onboarding experiment work · @Jad Haidar
0f3aa38 feat: add 4h Session Start heartbeat event for intra-day retention analysis · @Jad Haidar
389556a fix: correct fabricated "Open toby" volume baseline in Phase 1 halt threshold · @Jad Haidar
e7d00a8 chore: add build:store script to build all prod extension variants · @Jad Haidar
da1ba81 docs: add Installation device_id data caveat and additional analytics troubleshooting · @Jad Haidar
b9bea18 Post fallback Slack message when CWS review AI draft fails (#10) · @Jp
75a09e3 feat: add sandcastle agentic workflow for unattended slice execution · @Jad Haidar

Findings

medium · quality — `apps/extension/app/hooks/useSessionStart.ts:1-75` (0f3aa38)

await chrome.storage.local.set({
  lastSessionStart: now,
  sessionCount,
});

sendLogEvent(TRACK_EVENT_SESSION_START, {...}, userId);

New 75-line hook that drives a production analytics event with no test coverage. The interesting bits — the 4h rate limit, the createdDate → days_since_signup parse with Number.isFinite guard, the storage-write-before-send ordering — are exactly the kind of quiet logic that breaks silently and is invisible in BQ until someone runs a backfill query. Author's own spec (tasks/phase1-spec.md AC2/AC4) calls out that runtime validation in staging is "still required"; a unit test on the hook itself would have already caught a stale userId shape or a broken rate-limit math without waiting on staging. Suggested action: add useSessionStart.test.ts that fakes chrome.storage.local and asserts (1) skips within 4h, (2) fires after 4h, (3) userId is undefined when user is null, (4) days_since_signup is null for an unparseable createdDate. Ticket filed.

medium · security — `.sandcastle/scan-secrets.mts:1-182` (75a09e3)

{ name: 'anthropic-api-key', re: /sk-ant-[A-Za-z0-9_-]{8,}/ },
{ name: 'stripe-live-key',   re: /sk_live_[A-Za-z0-9]{16,}/ },
...
const SKIP_BASENAME_RE = /^(pnpm-lock\.yaml|...|\.min\.(js|css|map)|\.(png|jpg|...|woff2?|wasm))$/i;

The host-side secret scanner is the only thing standing between an unattended LLM agent and a pushed credential. It is a regex-on-added-lines walker with a skip-list for lockfiles and binaries — single-pattern, single-line, no inline-base64 detection, no multi-line obfuscation detection. The author's own comment ("Not designed against an adversarial agent that obfuscates the secret across multiple lines") is honest, but the function has no test coverage to even confirm the patterns and skip-list work as intended. A regression here is invisible until a real secret slips through. Suggested action: add .sandcastle/scan-secrets.test.mts with a fixture per PATTERNS entry (positive + negative), a fixture confirming the SKIP_BASENAME_RE covers pnpm-lock.yaml and .png, and a fixture confirming the diff parser counts the right newLineNo for added lines after a hunk header. Ticket filed.

medium · security — `.sandcastle/main.mts:602,648` (75a09e3)

if (process.env.SANDCASTLE_SKIP_GATES_VERIFY === '1') { ... return true; }
if (process.env.SANDCASTLE_SKIP_SECRET_SCAN === '1') { ... return true; }

Two env-var escape hatches fully disable the verification + secret-scan gates with nothing more than a console.warn on stdout. For a developer tool this is defensible, but as the workflow scales out (e.g. CI runner, shared CI image, a teammate's shell rc-file that exports them once "for debugging"), there's no audit trail that ties a pushed branch back to which gate was waived. Suggested action: record the skip flags as a field in the PR body when set, and append a line to a .sandcastle/audit.log that survives runs. Cheap, makes after-the-fact forensics possible. Ticket filed.

low · quality — `apps/api/context/jobs/cws_review_monitor_jobs.go:172` (b9bea18)

"*AI draft generation failed:*\n```%s```\n\n"+

safeErr := escapeSlackMrkdwn(draftErr.Error()) is good (handles @here, <links>), but Slack's escape rules don't cover backticks, so a draftErr.Error() that happens to contain a literal ``` (e.g. an OpenAI response body echoed back) breaks the triple-backtick fence and leaks the rest of the message into mrkdwn-parsed mode. Pure formatting hazard, not a security issue — the safeAuthor/safeComment paths already mitigate @channel injection. Suggested action: either swap to Slack Block Kit (type: "rich_text_preformatted") for the error block, or strip backticks from safeErr before interpolation. Doc-only; doesn't earn a ticket.

low · quality — `apps/extension/app/containers/Toby.tsx:300` (d68726b)

// Wait for both user and draft to hydrate before deciding which UI to show.
if (isInitializing || !isDraftReady || !isUserHydrated) return null;

Correct fix — adds the missing hydration gate so the briefly-null user doesn't flash Onboarding2 for returning users. The comment is crisp. No test added for the hydration window, but this is a single-line guard on a chain that already had two predicates; the dimension is small enough that a regression test would arguably be unit-testing React's render scheduler. Doc-only.

low · correctness — `.sandcastle/main.mts:158-159` (75a09e3)

const EXPECTED_GATES = ['lint', 'build'] as const;
// `typecheck` is INTENTIONALLY OMITTED ...
// `test` is INTENTIONALLY OMITTED ...

Worth flagging that the sandcastle workflow ships with typecheck and test gates disabled by design because the extension repo has pre-existing TS debt and no vitest scaffold on main. Any slice the agent produces will skip those — so a slice that compiles dirty or breaks an as-yet-uncommitted test will pass gates. The author's comment acknowledges this and points to the Step 8 review subagent as the backstop. Reasonable trade for now, but the moment the TS-debt branch lands, EXPECTED_GATES must be updated in lockstep — easy to forget. Doc-only for now; would warrant a tracking issue once the TS-clean branch is in flight.

low · quality — `.sandcastle/main.mts:863-867` (75a09e3)

execFileSync(
  'gh',
  ['pr', 'create', '--base', 'main', '--head', branch, '--title', title, '--body', body],
  { stdio: 'inherit' },
);

title is derived from latestSubject (the agent's commit message). The orchestrator uses execFileSync with an argv array — no shell, no injection. Worth a mention because the trust boundary is interesting: the agent fully controls its own commit subject and that's the PR title. Anything inflammatory in the subject becomes the PR title. Defensible since commits are reviewed in the PR anyway. Doc-only.

Filed tickets

useSessionStart hook needs unit-test coverage (improvement, medium)
.sandcastle/scan-secrets.mts needs unit-test coverage (improvement, medium)
Sandcastle skip-flags need audit logging (issue, low)

Toby code review · 2026-05-13

What shipped#

Findings#

medium · quality — apps/extension/app/hooks/useSessionStart.ts:1-75 (0f3aa38)#

medium · security — .sandcastle/scan-secrets.mts:1-182 (75a09e3)#

medium · security — .sandcastle/main.mts:602,648 (75a09e3)#

low · quality — apps/api/context/jobs/cws_review_monitor_jobs.go:172 (b9bea18)#

low · quality — apps/extension/app/containers/Toby.tsx:300 (d68726b)#

low · correctness — .sandcastle/main.mts:158-159 (75a09e3)#

low · quality — .sandcastle/main.mts:863-867 (75a09e3)#

Filed tickets#