A
AIOS Wiki
read-only · public mirror
Open AIOS
Wikiartifactstoby-incident-coordinator889c2366-0fe8-45ee-afb0-d293f41bd015artifacts/toby-incident-coordinator/889c2366-0fe8-45ee-afb0-d293f41bd015/synthesis-draft.md

Synthesis draft — retention_offers silent (2026-05-12)

Hand-authored·7 min read·10 sections·Last edited May 13 by initial import·View history

Root cause (high-confidence, agreed by both doctors)

The "0 offers issued" framing in the ticket conflates two different things. Three structural realities together explain the silence:

  1. retention_offers is accept-only by design. The migration (apps/api/data/migrations/V71__retention_offers.up.sql:1-15) and the only insert site (apps/api/context/v3/subscription_context.go:697-710) prove the table records ACCEPTS only. There is no status, no offered_at, no declined_at. Every "0 offers issued" data point is actually "0 accepts"; the metric the ticket asks about is unmeasurable from this table. (Backend finding, evidence section "Schema reality".)

  2. ~82% of cancels bypass the in-app retention modal entirely. Backend funnel for last 30d: 120 cancels → 22 cancellation_reasons → 1 retention/accept. That's an 82% leak between Stripe-status-flip and our FE modal. Best-supported explanation:

    • Stripe Customer Portal preloaded as a <Link href={stripeUrl} target="_blank">View</Link> in the same in-app Subscription panel — apps/extension/app/components/Modal/OrgSettings/Subscription.tsx:51-62, 192-208.
    • Every Stripe renewal-notice / receipt email contains a "Manage subscription" link to the same portal.
    • For team_legacy / team_basic users, the in-app cancel CTA is completely hidden (Subscription.tsx:41-43 hasSubscription gate excludes them) — they have no path other than the portal. February 2026 was peak churn driven by ThankYouLegacy renewals (per product/metrics/surveys/churn-survey-analysis.md), so exactly the cohort with the most cancel pressure is structurally invisible to retention.

    Cancels through any of those three paths reach Toby only as a Stripe customer.subscription.updated webhook — subscription_context.go:75-151 Cancel handler never invokes retention; the FE orchestrates the entire retention flow.

  3. Of the 22 users who DID submit a reason in 30d, only 1 accepted. Backend has zero telemetry on whether the other 21 saw the offer and declined, saw it and abandoned, or never had the FE call GET /retention-offer at all. The healthy ~4-10% accept-to-reason ratio held even in busier months (Feb-26: 134 reasons / 10 accepts = 7.5%). So this 30d isn't a regression — it's the long-running funnel reality, just with a smaller absolute sample.

Striking pattern: 16 of 17 all-time accepts are retention_legacy (legacy-user discount). Only 1 retention_yearly accept (today, 2026-05-12). Either FE doesn't surface the yearly retention offer to non-legacy yearly users, or they decline at 100%. Both doctors flag this as worth product attention.

Not a regression. No FE changes to the cancel modal since 2026-03-31 (cbc92a78d widened eligibility); no apps/api deploys since 2026-02-02. The funnel has been a leaky structural shape for months — Feb 2026's spike just made the absolute numbers look "fine" in aggregate.

Proposed fix — tiered

Tier 1 — Backend instrumentation patch (this incident's actual ship candidate)

Add a structured log line in GetRetentionOffer (apps/api/context/v3/subscription_context.go, around line 599-607) so we emit a retention_offer_eligible event whenever result.Eligible == true. This:

  • Gives ops/product a real "offers shown eligible" datapoint that can be counted in Cloud Logging without a schema change.
  • Crucially, allows us to compute the real funnel ratio (cancellation_reasons → eligible-offers → accepts) over the next 14-30 days and prove or disprove the Tier 2/3 product hypotheses below.
  • 10-LOC patch, no behavior change, no risk to user-facing flow.
  • Recommended by backend doctor as the cheapest visibility win.

Concrete diff (target):

// apps/api/context/v3/subscription_context.go, around L599-607 (inside GetRetentionOffer, after eligibility evaluated, before response built)
if result.Eligible {
    log.Info("retention_offer_eligible",
        "team_id", team.ID,
        "user_id", userID,
        "offer_type", result.OfferType,
        "coupon_id", result.CouponID,
    )
}

(Exact log API and result field names should be confirmed against the surrounding code at apply time — both doctors cited that approximate region; fix-shipper to confirm at L599-607.)

Tier 2 — Product/FE — open questions (NOT shipped in this incident)

Each of these is a product decision; the doc surfaces them so they can be filed as their own tickets:

  • Stripe Portal bypass. Should Subscription.tsx hide the "View invoices" link until after the user has gone through the retention modal? Or should we configure Stripe flow_data so portal-side cancel redirects back to Toby's retention flow? Either eliminates the largest single bypass path.
  • Legacy/basic CTA gate. Should hasSubscription (Subscription.tsx:41-43) stop excluding team_legacy / team_basic? Their cancel pressure is the worst-case (peak churn) and is currently structurally invisible to retention.
  • Yearly retention offer. Why are zero non-legacy yearly users accepting retention_yearly? Is the FE rendering it (subscription_context.go:475-497 branch), or is it silently skipped?

Tier 3 — Schema/analytics work (NOT shipped)

  • Add a retention_offer_views table OR add status / offered_at / declined_at columns to retention_offers.
  • Wire Amplitude RETENTION_OFFER_SHOWN / RETENTION_OFFER_DECLINED events into the BI pipeline so the funnel is visible from the analytics side.

Tier 4 — Housekeeping (NOT shipped — separate ticket worthy)

The 5 TOBY_RETENTION* secrets don't exist in GCP Secret Manager. gcloud secrets list --filter=name~TOBY_RETENTION returned []. The code falls back to struct-tag defaults (which are correct), but every cold start logs 5 lines of failed to access secret version. File a separate ticket to either create the secrets or remove the lookup.

Verify plan (for Tier 1, the ship candidate)

After deploy of the log-line patch:

  1. Wait 24h (cancel-flow traffic is sparse; 1-2 retention_offer events per day in healthy weeks).
  2. Cloud Logging query in toby-production-286416:
    resource.labels.service_name="prod-api"
    AND jsonPayload.message="retention_offer_eligible"
    AND timestamp >= "2026-05-13T00:00:00Z"
    
  3. Expectation: at least 1-3 events over 24-48h (consistent with the historical 22 reasons / month → ~1 per day eligible). If zero, that itself is signal — either the FE isn't calling /retention-offer after reason submit, or all callers are getting eligible:false. Either way, the patch succeeded (it's making invisible behavior visible).
  4. Compare to the same window's count(cancellation_reasons.created_at) to compute the FE-funnel ratio.

After 14 days, this telemetry should be sufficient to file the Tier 2/Tier 3 follow-up tickets with real numbers.

Open questions for the operator

  • Is Tier 1 alone enough to "close" TOBY-6? Argument for: the ticket is "save flow not triggering or not logging" — Tier 1 fixes the logging dimension definitively and creates the instrumentation to investigate the triggering dimension. Argument against: the user-visible problem (low retention save rate) is untouched; Tier 1 is a visibility patch, not a save-rate patch.
  • Recommended: ship Tier 1 to close TOBY-6 as "validated diagnosis + visibility patch shipped"; file three sibling tickets for Tier 2-4 and let product own them with the new telemetry once it's flowing.

Evidence citations

ClaimSource
Schema is accept-onlyBackend finding L75-88, code at apps/api/data/migrations/V71__retention_offers.up.sql:1-15, apps/api/context/v3/subscription_context.go:697-710
30d funnel 120 → 22 → 1Backend finding L52-61 (DB queries)
Stripe Portal "View" preloadFrontend finding L92-105, apps/extension/app/components/Modal/OrgSettings/Subscription.tsx:51-62, 192-208
Legacy users have no in-app CTAFrontend finding L119-135, apps/extension/app/components/Modal/OrgSettings/Subscription.tsx:41-43
16/17 accepts are retention_legacyBackend finding L90-95
No recent regression in FE or BEFrontend finding L186-190, Backend finding L31 (revision deployed 2026-02-02)
Missing TOBY_RETENTION secretsBackend finding L118-142, gcloud secrets list --filter=name~TOBY_RETENTION[]
Live happy-path sequence proves backend healthyBackend finding L98-110 (Cloud Logging trace for team ce2cc1ac on 2026-05-12)

Self-applied triple-check on this draft

  • Correctness: Funnel numbers cross-checked against backend doctor's queries (120, 22, 1) and ratio (4-10% accept-to-reason) consistent across 4 months. Schema fact verified from migration + insert site. Bypass paths verified by reading FE code paths cited.
  • Quality: Tier 1 is a minimal, observable patch with a real verify plan. Tier 2-4 surfaced as open work items rather than over-promising. Citations attach every load-bearing claim to a finding artifact path.
  • Safety: Tier 1 is logging-only — no behavior change, no schema change, no user impact. The fix-shipper can go build + go vet and ship to a feature branch off origin/main. If the log signature surprises, it's a 10-line revert.

Sending to validator next.