artifacts/toby-incident-coordinator/889c2366-0fe8-45ee-afb0-d293f41bd015/synthesis-draft.mdSynthesis draft — retention_offers silent (2026-05-12)
Root cause (high-confidence, agreed by both doctors)
The "0 offers issued" framing in the ticket conflates two different things. Three structural realities together explain the silence:
-
retention_offersis accept-only by design. The migration (apps/api/data/migrations/V71__retention_offers.up.sql:1-15) and the only insert site (apps/api/context/v3/subscription_context.go:697-710) prove the table records ACCEPTS only. There is nostatus, nooffered_at, nodeclined_at. Every "0 offers issued" data point is actually "0 accepts"; the metric the ticket asks about is unmeasurable from this table. (Backend finding, evidence section "Schema reality".) -
~82% of cancels bypass the in-app retention modal entirely. Backend funnel for last 30d: 120 cancels → 22 cancellation_reasons → 1 retention/accept. That's an 82% leak between Stripe-status-flip and our FE modal. Best-supported explanation:
- Stripe Customer Portal preloaded as a
<Link href={stripeUrl} target="_blank">View</Link>in the same in-app Subscription panel —apps/extension/app/components/Modal/OrgSettings/Subscription.tsx:51-62, 192-208. - Every Stripe renewal-notice / receipt email contains a "Manage subscription" link to the same portal.
- For
team_legacy/team_basicusers, the in-app cancel CTA is completely hidden (Subscription.tsx:41-43hasSubscriptiongate excludes them) — they have no path other than the portal. February 2026 was peak churn driven by ThankYouLegacy renewals (perproduct/metrics/surveys/churn-survey-analysis.md), so exactly the cohort with the most cancel pressure is structurally invisible to retention.
Cancels through any of those three paths reach Toby only as a Stripe
customer.subscription.updatedwebhook —subscription_context.go:75-151Cancelhandler never invokes retention; the FE orchestrates the entire retention flow. - Stripe Customer Portal preloaded as a
-
Of the 22 users who DID submit a reason in 30d, only 1 accepted. Backend has zero telemetry on whether the other 21 saw the offer and declined, saw it and abandoned, or never had the FE call
GET /retention-offerat all. The healthy ~4-10% accept-to-reason ratio held even in busier months (Feb-26: 134 reasons / 10 accepts = 7.5%). So this 30d isn't a regression — it's the long-running funnel reality, just with a smaller absolute sample.
Striking pattern: 16 of 17 all-time accepts are retention_legacy (legacy-user discount). Only 1 retention_yearly accept (today, 2026-05-12). Either FE doesn't surface the yearly retention offer to non-legacy yearly users, or they decline at 100%. Both doctors flag this as worth product attention.
Not a regression. No FE changes to the cancel modal since 2026-03-31 (cbc92a78d widened eligibility); no apps/api deploys since 2026-02-02. The funnel has been a leaky structural shape for months — Feb 2026's spike just made the absolute numbers look "fine" in aggregate.
Proposed fix — tiered
Tier 1 — Backend instrumentation patch (this incident's actual ship candidate)
Add a structured log line in GetRetentionOffer (apps/api/context/v3/subscription_context.go, around line 599-607) so we emit a retention_offer_eligible event whenever result.Eligible == true. This:
- Gives ops/product a real "offers shown eligible" datapoint that can be counted in Cloud Logging without a schema change.
- Crucially, allows us to compute the real funnel ratio (cancellation_reasons → eligible-offers → accepts) over the next 14-30 days and prove or disprove the Tier 2/3 product hypotheses below.
- 10-LOC patch, no behavior change, no risk to user-facing flow.
- Recommended by backend doctor as the cheapest visibility win.
Concrete diff (target):
// apps/api/context/v3/subscription_context.go, around L599-607 (inside GetRetentionOffer, after eligibility evaluated, before response built)
if result.Eligible {
log.Info("retention_offer_eligible",
"team_id", team.ID,
"user_id", userID,
"offer_type", result.OfferType,
"coupon_id", result.CouponID,
)
}
(Exact log API and result field names should be confirmed against the surrounding code at apply time — both doctors cited that approximate region; fix-shipper to confirm at L599-607.)
Tier 2 — Product/FE — open questions (NOT shipped in this incident)
Each of these is a product decision; the doc surfaces them so they can be filed as their own tickets:
- Stripe Portal bypass. Should
Subscription.tsxhide the "View invoices" link until after the user has gone through the retention modal? Or should we configure Stripeflow_dataso portal-side cancel redirects back to Toby's retention flow? Either eliminates the largest single bypass path. - Legacy/basic CTA gate. Should
hasSubscription(Subscription.tsx:41-43) stop excludingteam_legacy/team_basic? Their cancel pressure is the worst-case (peak churn) and is currently structurally invisible to retention. - Yearly retention offer. Why are zero non-legacy yearly users accepting
retention_yearly? Is the FE rendering it (subscription_context.go:475-497 branch), or is it silently skipped?
Tier 3 — Schema/analytics work (NOT shipped)
- Add a
retention_offer_viewstable OR addstatus/offered_at/declined_atcolumns toretention_offers. - Wire Amplitude
RETENTION_OFFER_SHOWN/RETENTION_OFFER_DECLINEDevents into the BI pipeline so the funnel is visible from the analytics side.
Tier 4 — Housekeeping (NOT shipped — separate ticket worthy)
The 5 TOBY_RETENTION* secrets don't exist in GCP Secret Manager. gcloud secrets list --filter=name~TOBY_RETENTION returned []. The code falls back to struct-tag defaults (which are correct), but every cold start logs 5 lines of failed to access secret version. File a separate ticket to either create the secrets or remove the lookup.
Verify plan (for Tier 1, the ship candidate)
After deploy of the log-line patch:
- Wait 24h (cancel-flow traffic is sparse; 1-2 retention_offer events per day in healthy weeks).
- Cloud Logging query in
toby-production-286416:resource.labels.service_name="prod-api" AND jsonPayload.message="retention_offer_eligible" AND timestamp >= "2026-05-13T00:00:00Z" - Expectation: at least 1-3 events over 24-48h (consistent with the historical 22 reasons / month → ~1 per day eligible). If zero, that itself is signal — either the FE isn't calling
/retention-offerafter reason submit, or all callers are gettingeligible:false. Either way, the patch succeeded (it's making invisible behavior visible). - Compare to the same window's
count(cancellation_reasons.created_at)to compute the FE-funnel ratio.
After 14 days, this telemetry should be sufficient to file the Tier 2/Tier 3 follow-up tickets with real numbers.
Open questions for the operator
- Is Tier 1 alone enough to "close" TOBY-6? Argument for: the ticket is "save flow not triggering or not logging" — Tier 1 fixes the logging dimension definitively and creates the instrumentation to investigate the triggering dimension. Argument against: the user-visible problem (low retention save rate) is untouched; Tier 1 is a visibility patch, not a save-rate patch.
- Recommended: ship Tier 1 to close TOBY-6 as "validated diagnosis + visibility patch shipped"; file three sibling tickets for Tier 2-4 and let product own them with the new telemetry once it's flowing.
Evidence citations
| Claim | Source |
|---|---|
| Schema is accept-only | Backend finding L75-88, code at apps/api/data/migrations/V71__retention_offers.up.sql:1-15, apps/api/context/v3/subscription_context.go:697-710 |
| 30d funnel 120 → 22 → 1 | Backend finding L52-61 (DB queries) |
| Stripe Portal "View" preload | Frontend finding L92-105, apps/extension/app/components/Modal/OrgSettings/Subscription.tsx:51-62, 192-208 |
| Legacy users have no in-app CTA | Frontend finding L119-135, apps/extension/app/components/Modal/OrgSettings/Subscription.tsx:41-43 |
| 16/17 accepts are retention_legacy | Backend finding L90-95 |
| No recent regression in FE or BE | Frontend finding L186-190, Backend finding L31 (revision deployed 2026-02-02) |
| Missing TOBY_RETENTION secrets | Backend finding L118-142, gcloud secrets list --filter=name~TOBY_RETENTION → [] |
| Live happy-path sequence proves backend healthy | Backend finding L98-110 (Cloud Logging trace for team ce2cc1ac on 2026-05-12) |
Self-applied triple-check on this draft
- Correctness: Funnel numbers cross-checked against backend doctor's queries (120, 22, 1) and ratio (4-10% accept-to-reason) consistent across 4 months. Schema fact verified from migration + insert site. Bypass paths verified by reading FE code paths cited.
- Quality: Tier 1 is a minimal, observable patch with a real verify plan. Tier 2-4 surfaced as open work items rather than over-promising. Citations attach every load-bearing claim to a finding artifact path.
- Safety: Tier 1 is logging-only — no behavior change, no schema change, no user impact. The fix-shipper can
go build+go vetand ship to a feature branch offorigin/main. If the log signature surprises, it's a 10-line revert.
Sending to validator next.