Toby code review · 2026-05-13
First review on this watch — back-filled the most recent 10 commits on `main`. Bulk of the volume is Jad's onboarding-experiment instrumentation (Session Start heartbeat + planning docs) and a new sandcastle agentic-workflow subsystem under `.sandcastle/`. One Slack-fallback fix from Jp on the CWS review monitor.
What shipped
d68726bfix: gate AuthWrapper on user hydration to prevent duplicate onboarding events · @Jad Haidar9d6e8e4chore: update extension version to 1.13.0 · @Jad Haidarb1fe667docs: add BigQuery schema, connection IDs, and data caveats to analytics skill (#9) · @Jad Haidar406849docs: add phase 1 and phase 2 specs, plans, and todos for onboarding experiment work · @Jad Haidar0f3aa38feat: add 4h Session Start heartbeat event for intra-day retention analysis · @Jad Haidar389556afix: correct fabricated "Open toby" volume baseline in Phase 1 halt threshold · @Jad Haidare7d00a8chore: add build:store script to build all prod extension variants · @Jad Haidarda1ba81docs: add Installation device_id data caveat and additional analytics troubleshooting · @Jad Haidarb9bea18Post fallback Slack message when CWS review AI draft fails (#10) · @Jp75a09e3feat: add sandcastle agentic workflow for unattended slice execution · @Jad Haidar
Findings
medium · quality — apps/extension/app/hooks/useSessionStart.ts:1-75 (0f3aa38)
await chrome.storage.local.set({ lastSessionStart: now, sessionCount, }); sendLogEvent(TRACK_EVENT_SESSION_START, {...}, userId);
New 75-line hook that drives a production analytics event with no test
coverage. The interesting bits — the 4h rate limit, the
createdDate → days_since_signup parse with Number.isFinite guard,
the storage-write-before-send ordering — are exactly the kind of
quiet logic that breaks silently and is invisible in BQ until someone
runs a backfill query. Author's own spec (tasks/phase1-spec.md AC2/AC4)
calls out that runtime validation in staging is "still required"; a unit
test on the hook itself would have already caught a stale userId
shape or a broken rate-limit math without waiting on staging. Suggested
action: add useSessionStart.test.ts that fakes chrome.storage.local
and asserts (1) skips within 4h, (2) fires after 4h, (3) userId is
undefined when user is null, (4) days_since_signup is null for an
unparseable createdDate. Ticket filed.
medium · security — .sandcastle/scan-secrets.mts:1-182 (75a09e3)
{ name: 'anthropic-api-key', re: /sk-ant-[A-Za-z0-9_-]{8,}/ }, { name: 'stripe-live-key', re: /sk_live_[A-Za-z0-9]{16,}/ }, ... const SKIP_BASENAME_RE = /^(pnpm-lock\.yaml|...|\.min\.(js|css|map)|\.(png|jpg|...|woff2?|wasm))$/i;
The host-side secret scanner is the only thing standing between an
unattended LLM agent and a pushed credential. It is a regex-on-added-lines
walker with a skip-list for lockfiles and binaries — single-pattern,
single-line, no inline-base64 detection, no multi-line obfuscation
detection. The author's own comment ("Not designed against an adversarial
agent that obfuscates the secret across multiple lines") is honest, but
the function has no test coverage to even confirm the patterns and
skip-list work as intended. A regression here is invisible until a real
secret slips through. Suggested action: add
.sandcastle/scan-secrets.test.mts with a fixture per PATTERNS entry
(positive + negative), a fixture confirming the SKIP_BASENAME_RE
covers pnpm-lock.yaml and .png, and a fixture confirming the diff
parser counts the right newLineNo for added lines after a hunk header.
Ticket filed.
medium · security — .sandcastle/main.mts:602,648 (75a09e3)
if (process.env.SANDCASTLE_SKIP_GATES_VERIFY === '1') { ... return true; } if (process.env.SANDCASTLE_SKIP_SECRET_SCAN === '1') { ... return true; }
Two env-var escape hatches fully disable the verification + secret-scan
gates with nothing more than a console.warn on stdout. For a developer
tool this is defensible, but as the workflow scales out (e.g. CI runner,
shared CI image, a teammate's shell rc-file that exports them once "for
debugging"), there's no audit trail that ties a pushed branch back to
which gate was waived. Suggested action: record the skip flags as a
field in the PR body when set, and append a line to a
.sandcastle/audit.log that survives runs. Cheap, makes after-the-fact
forensics possible. Ticket filed.
low · quality — apps/api/context/jobs/cws_review_monitor_jobs.go:172 (b9bea18)
"*AI draft generation failed:*\n```%s```\n\n"+
safeErr := escapeSlackMrkdwn(draftErr.Error()) is good (handles
@here, <links>), but Slack's escape rules don't cover backticks, so a
draftErr.Error() that happens to contain a literal ``` (e.g. an
OpenAI response body echoed back) breaks the triple-backtick fence and
leaks the rest of the message into mrkdwn-parsed mode. Pure formatting
hazard, not a security issue — the safeAuthor/safeComment paths
already mitigate @channel injection. Suggested action: either swap to
Slack Block Kit (type: "rich_text_preformatted") for the error block,
or strip backticks from safeErr before interpolation. Doc-only;
doesn't earn a ticket.
low · quality — apps/extension/app/containers/Toby.tsx:300 (d68726b)
// Wait for both user and draft to hydrate before deciding which UI to show.
if (isInitializing || !isDraftReady || !isUserHydrated) return null;
Correct fix — adds the missing hydration gate so the briefly-null
user doesn't flash Onboarding2 for returning users. The comment is
crisp. No test added for the hydration window, but this is a single-line
guard on a chain that already had two predicates; the dimension is small
enough that a regression test would arguably be unit-testing React's
render scheduler. Doc-only.
low · correctness — .sandcastle/main.mts:158-159 (75a09e3)
const EXPECTED_GATES = ['lint', 'build'] as const; // `typecheck` is INTENTIONALLY OMITTED ... // `test` is INTENTIONALLY OMITTED ...
Worth flagging that the sandcastle workflow ships with typecheck and
test gates disabled by design because the extension repo has
pre-existing TS debt and no vitest scaffold on main. Any slice the
agent produces will skip those — so a slice that compiles dirty or
breaks an as-yet-uncommitted test will pass gates. The author's comment
acknowledges this and points to the Step 8 review subagent as the
backstop. Reasonable trade for now, but the moment the TS-debt branch
lands, EXPECTED_GATES must be updated in lockstep — easy to forget.
Doc-only for now; would warrant a tracking issue once the TS-clean
branch is in flight.
low · quality — .sandcastle/main.mts:863-867 (75a09e3)
execFileSync(
'gh',
['pr', 'create', '--base', 'main', '--head', branch, '--title', title, '--body', body],
{ stdio: 'inherit' },
);
title is derived from latestSubject (the agent's commit message).
The orchestrator uses execFileSync with an argv array — no shell, no
injection. Worth a mention because the trust boundary is interesting:
the agent fully controls its own commit subject and that's the PR
title. Anything inflammatory in the subject becomes the PR title.
Defensible since commits are reviewed in the PR anyway. Doc-only.
Filed tickets
useSessionStarthook needs unit-test coverage (improvement, medium).sandcastle/scan-secrets.mtsneeds unit-test coverage (improvement, medium)- Sandcastle skip-flags need audit logging (issue, low)