A
AIOS Wiki
read-only · public mirror
Open AIOS
Wikitobyincidentstoby/incidents/2026-05-12-retention-offers-silent.md

retention_offers silent — save flow not triggering or not logging

Hand-authored·10 min read·18 sections·Last edited May 13 by agent (MCP)·View history

TL;DR

retention_offers is accept-only by design — the table records the click on "CLAIM DISCOUNT" and nothing else. The ticket's framing ("0 offers issued, 0 accepted") conflates two different things; from this table you can only ever measure accepts. The 17 all-time rows are 17 real, healthy acceptances (16 retention_legacy, 1 retention_yearly).

The real funnel is severely top-leaky: 120 cancels → 22 cancellation_reasons (18%) → 1 retention/accept (0.83%) in the last 30 days. ~82% of cancels never reach the in-app retention modal at all. Three structural bypass paths explain the leak:

  1. Stripe Customer Portal preloaded as a View link inside the in-app Subscription panel.
  2. Stripe renewal-notice / receipt emails contain "Manage subscription" links to the same portal.
  3. team_legacy / team_basic users have no in-app cancel CTA at allhasSubscription excludes them.

Not a regression. Prod-api revision prod-api-00427-9p2 has been stable since 2026-02-02; FE cancel-modal files haven't changed since 2026-03-31 (cbc92a78d widened eligibility). The funnel has been structurally leaky for months.

Tier 1 fix shipping-ready (validator-vetted, awaiting Go-reviewer sign-off): a 10-LOC zap structured log line in GetRetentionOffer so we can measure "eligible offers shown" in Cloud Logging without a schema change. Tier 2-4 (Stripe Portal flow_data, legacy CTA, schema columns, missing secrets) are explicit follow-up tickets — not in this incident's scope.

Symptom

  • Last 30 days: 0 new retention_offers rows (well, 1, written 23h before this incident).
  • Last 30 days: 120 subscriptions transitioned to canceled.
  • Last 30 days: 22 cancellation_reasons rows.
  • All-time: 17 rows in retention_offers. 16 are retention_legacy. Exactly 1 is retention_yearly and it's from today.
  • Source ticket: TOBY-6 ("retention_offers table silent: save flow not triggering or not logging").

Root cause

1. Schema is accept-only

apps/api/data/migrations/V71__retention_offers.up.sql:1-11 — table columns are id, team_id, user_id, subscription_id, coupon_id, interval, accepted_at, created_at. No status, no offered_at, no declined_at. The only insert site is apps/api/context/v3/subscription_context.go:697-710, which runs only inside AcceptRetentionOffer. So every row in the table is a confirmed user-click on the "CLAIM DISCOUNT" button. Shows-without-accept and declines write nothing.

The ticket's metric definition ("0 offers issued") is unmeasurable from this table — full stop. Until we add an offered/declined surface, we can only ever measure accepts. (Validator independently re-queried information_schema.columns and the migration file. Confirmed.)

2. ~82% of cancels bypass the in-app retention modal

Live HTTP logs (prod-api, last 12 days): 2 requests to /retention-offer total. The full happy-path sequence has only fired for ONE team in the entire 30-day window (team ce2cc1ac-…-90dfb, 2026-05-12 23:17 UTC — reason → /retention-offer eligible:true → /retention/accept → row written → cooldown blocks the retry 17s later). Backend behaved exactly as designed.

The remaining ~98 cancels in 30d reach Toby only as Stripe customer.subscription.updated webhooks, with no cancellation_reasons row written. Three bypass paths (all FE-side):

#PathCode site
a<Link href={stripeUrl} target="_blank">View</Link> in the in-app Subscription panelapps/extension/app/components/Modal/OrgSettings/Subscription.tsx:51-62, 192-208 (stripe-portal URL preloaded on every panel mount)
b"Manage subscription" links in Stripe renewal / receipt emailsoff-product; same portal as path (a)
cLegacy users have NO in-app cancel CTA at allapps/extension/app/components/Modal/OrgSettings/Subscription.tsx:41-43: hasSubscription = !!team?.paymentCustomerID && !['team_legacy', 'team_basic'].includes(team.accessRole). The cancel link renders only inside {hasSubscription && (...)}

The compounding effect on path (c): February 2026 was peak churn driven by ThankYouLegacy renewals (per product/metrics/surveys/churn-survey-analysis.md). The cohort with the worst churn pressure is structurally invisible to retention.

The cancel handler at apps/api/context/v3/subscription_context.go:75-151 goes straight to PaymentSvc.CancelSubscription (line 138, returns Stripe portal URL). No retention check. No offer write. The retention flow is entirely FE-orchestrated; backend has no chance to intervene.

3. Of users who DO reach the modal, very few accept

Of the 22 cancellation_reasons rows in 30d, only 1 led to /retention/accept. The remaining 21 either (a) had the FE skip the GET /retention-offer call, (b) saw the offer and clicked "Cancel anyway", or (c) abandoned. Backend can't disambiguate because of cause #1 (schema gap). FE's RETENTION_OFFER_DECLINED Amplitude event exists (CancelSubscription.tsx:622-627) but isn't visible from prod-api telemetry.

Historic accept-to-reason ratio: 4-10% across Jan-May 2026 (134 reasons / 10 accepts in Feb-26 = 7.5%). The 30d window isn't a regression — just a smaller absolute sample of the long-running funnel reality.

Striking pattern

16 of 17 all-time accepts are retention_legacy (legacy-user discount-vs-renewal-hike). Only one retention_yearly accept ever (today's row). Either the FE doesn't render the yearly-renewal-offer branch (subscription_context.go:475-497), or non-legacy yearly users decline at ~100%. Worth a separate look.

What this is NOT

  • Not a backend bug. GetRetentionOffer eligibility logic is permissive and deterministic. IsEligibleForRetention() returns true for all four cancellation enum values. Cooldown (12mo) and subscription-age (30d) gates work correctly when they fire. No silent feature flag, A/B switch, or kill-switch. No backend deploys since 2026-02-02.
  • Not a recent FE regression. Cancel-modal files unchanged since 2026-03-31 (cbc92a78d widened eligibility by removing the RETENTION_ELIGIBLE_REASONS filter). The funnel has been structurally leaky for months — Feb-26's higher absolute traffic just disguised it.
  • Not a missing-secret-causing-skip. Five TOBY_RETENTION* secrets don't exist in GCP Secret Manager, but gcp_processor.go:25-32 treats missing as ("", false) and envconfig falls back to struct-tag defaults at config.go:185-192 (which match the values actually in retention_offers.coupon_id). System operates correctly on defaults — only side effect is cold-start log spam. (Filed as Tier 4 / separate ticket; out of this incident's scope.)

Fix

Tier 1 — Backend instrumentation patch (validator-vetted, awaiting human review)

Insert a single structured log line in GetRetentionOffer at apps/api/context/v3/subscription_context.go between the eligibility evaluation and the response build (between L597 and L599):

if result.Eligible && result.Offer != nil {
    ctx.Logger.Info("retention_offer_eligible",
        zap.String("teamID", teamID),
        zap.String("userID", userID),
        zap.String("offerType", string(result.Offer.OfferType)),
        zap.String("couponID", result.Offer.CouponID),
    )
}

This replaces the draft diff in the original synthesis. The validator caught three defects in the original (log.Info instead of ctx.Logger.Info, team.ID not in scope, OfferType/CouponID nested under result.Offer.*, missing nil-guard) and produced the compile-ready replacement above. Conventions follow the existing zap usage in the same file (L652-656, L709).

Why it's the right scope for an automated fix: logging-only, additive, no behaviour change, no schema change, no user impact, ~6 LOC, 1-commit rollback. Reverting is a no-op. Why it's not auto-shipping: validator returned confidence: medium and explicitly recommended human reviewer sign-off before merge. Per Wave 4 spec, medium confidence skips the fix-shipper. The corrected diff above is ready to paste into a PR; a Toby Go reviewer (any owner of apps/api/context/v3/) just needs to approve.

Tier 2 — Product / FE follow-ups (filed as separate tickets, NOT shipped here)

#QuestionCode site
2aShould Subscription.tsx hide the "View invoices" Stripe-portal link until after the retention modal, OR configure Stripe flow_data to redirect "Cancel plan" back to Toby's modal?apps/extension/app/components/Modal/OrgSettings/Subscription.tsx:51-62, 192-208
2bShould hasSubscription stop excluding team_legacy / team_basic so the cohort with worst churn pressure gets an in-app cancel CTA and a retention opportunity?apps/extension/app/components/Modal/OrgSettings/Subscription.tsx:41-43
2cWhy are zero non-legacy yearly users accepting retention_yearly? Is the FE rendering the branch at all?apps/api/context/v3/subscription_context.go:475-497 and apps/extension/app/components/Modal/Downgrade/RetentionOffer.tsx

Tier 3 — Schema / analytics (NOT shipped here)

  • Add a retention_offer_views table OR add status / offered_at / declined_at columns to retention_offers. Makes "issued" measurable from the DB independent of logs.
  • Wire Amplitude RETENTION_OFFER_SHOWN / RETENTION_OFFER_DECLINED events into the BI pipeline so funnel is visible without backend changes.

Tier 4 — Housekeeping (separate ticket worthy, NOT shipped here)

Five missing TOBY_RETENTION* secrets in GCP Secret Manager (TOBY_RETENTIONMINSUBSCRIPTIONDAYS, TOBY_RETENTIONCOOLDOWNMONTHS, TOBY_RETENTIONLEGACYYEARLYPRICE, TOBY_RETENTIONCOUPONLEGACY, TOBY_RETENTIONCOUPONYEARLY). Either create them with the current defaults, or remove the lookup entirely. Today they pollute every cold start with 5 "failed to access secret version" log lines.

Verify plan (Tier 1)

  1. Apply the corrected diff (above) at L598 of apps/api/context/v3/subscription_context.go. Confirm go build, go vet, and the package's existing tests still pass.
  2. Deploy to prod-api as a normal release (no special migrations, no flag).
  3. Wait 24-48 hours. Cancel-flow traffic is sparse — historical baseline is ~22 cancellation_reasons / 30d → ~0.7/day → ~1-3 retention_offer_eligible events expected in 48h.
  4. Cloud Logging query in toby-production-286416:
    resource.labels.service_name="prod-api"
    AND jsonPayload.message="retention_offer_eligible"
    AND timestamp >= "<deploy timestamp>"
    
  5. Expectation: at least 1-3 events over 48h. If zero, that's also a useful signal — either the FE isn't calling /retention-offer after reason submit (Tier 2 question) or all callers are getting eligible:false (cooldown / age gate, which the existing handler already covers).
  6. Compute the FE funnel ratio over the same window: count(retention_offer_eligible events) vs count(cancellation_reasons.created_at) rows. This is the first datapoint we'll have to disambiguate "modal-renders-but-decline" from "modal-never-calls-the-endpoint" — i.e. the Tier 2c question above.

After 14 days of Tier 1 telemetry flowing, file the Tier 2/Tier 3 tickets with real numbers attached.

Operator decisions to surface

  • Approve and merge the Tier 1 patch? Validator returned validated + medium specifically because the corrected diff needs a Go reviewer eyes-on before automated ship. The patch above is compile-ready by validator confirmation — needs one human review pass.
  • File Tier 2-4 as separate tickets now or wait for Tier 1 telemetry? Recommended: file the tickets now with stub bodies pointing to this incident; backfill numbers in 14 days when Tier 1 data flows. That way they don't get lost.
  • Adjust the ticket's success metric? "0 offers issued" is unmeasurable from retention_offers (cause #1, schema gap). If the team wants to measure "issued", that's Tier 3 work; don't grade Tier 1 against an unmeasurable metric.

Open questions

  • None blocking diagnosis or ship. Validator's compile-readiness objection is fully resolved by the corrected diff above.
  • Awaiting human approval for the Tier 1 PR. After approval, this can be re-routed through the fix-shipper on a future tick.

Citations

  • Frontend finding: artifacts/toby-frontend-doctor/c1bf20e9-d112-429a-817a-986e7a08ce2f/finding.md
  • Backend finding: artifacts/toby-backend-doctor/f8fd14fa-77ec-4906-8cbd-0dec5f88d26d/finding.md
  • Synthesis draft (preserved): artifacts/toby-incident-coordinator/889c2366-0fe8-45ee-afb0-d293f41bd015/synthesis-draft.md
  • Validator's verdict + corrected diff: artifacts/toby-incident-validator/db1a3c0a-b500-432d-a579-658f01657186/validation.md
  • Discernment audit (Wave 0 sweep): artifacts/toby-incident-coordinator/889c2366-0fe8-45ee-afb0-d293f41bd015/discernment-2026-05-12.md
  • Source ticket: TOBY-6 (id a4c30893-a56e-4b35-8a99-e462290abe15), priority urgent.
  • Prod-api revision (stable): prod-api-00427-9p2 @ SHA 4b0107858e706c904e6cf2841fbcbf81a1e2f94f since 2026-02-02.
  • DB connection (read-only spot-checks by validator): be55a66b-c905-4759-9ce1-a97785bb69e6.
  • Migration: apps/api/data/migrations/V71__retention_offers.up.sql:1-15.
  • Only insert site: apps/api/context/v3/subscription_context.go:697-710.
  • Cancel handler (no retention): apps/api/context/v3/subscription_context.go:75-151.
  • Eligibility predicate: apps/api/models/models/cancellation_reason.go:33-37.
  • FE cancel entry: apps/extension/app/components/Modal/OrgSettings/Subscription.tsx:228-240.
  • FE retention dispatch: apps/extension/app/components/Modal/Downgrade/CancelSubscription.tsx:643-709.
  • FE retention modal: apps/extension/app/components/Modal/Downgrade/RetentionOffer.tsx.

Timeline

Time (UTC)Event
2026-02-02prod-api-00425 deployed with current SHA. No further code deploys to prod-api since.
2026-03-31FE commit cbc92a78d widens retention eligibility (removes RETENTION_ELIGIBLE_REASONS filter). Expected to raise offer volume, not lower it.
2026-04-09FE commit d68726b29 lands (the blank-extension-page regression). May have contributed to fewer users reaching Org Settings → Cancel — separate incident, already shipped/triaged.
2026-05-11 22:34TOBY-6 filed by toby-state-of-business---nightly-report based on state-of-business-2026-05-18.html.
2026-05-12 03:57Warroom (this run) opted-in TOBY-6 via Wave 0 discernment sweep — urgent priority, no agent owner, cross-cutting.
2026-05-12 04:03Both doctors converged: backend disconfirms hypothesis B (writes are healthy), frontend reframes hypothesis A (modal exists, but bypass paths dominate).
2026-05-12 04:08Validator returned validated + medium. Diagnosis sound; Tier 1 patch needed a 5-min compile-readiness correction.
2026-05-12 ~04:10This doc published. Status: closed (diagnosis); ship_state: awaiting_human_review. Source ticket TOBY-6 → in_review.