artifacts/toby-backend-doctor/f8fd14fa-77ec-4906-8cbd-0dec5f88d26d/finding.mdTOBY-6 — Backend finding
TL;DR
Hypothesis B (events not writing) is wrong. The accept-write path is live and working — POST /v3/teams/:id:admin/subscriptions/cancel/retention/accept wrote a row 23 hours ago (2026-05-12 23:17 UTC) and returned 200. The 17 all-time rows are all real, consistent accepts.
Hypothesis A (FE never reaches save-offer step) is strongly supported. In the last 30 days only ONE retention/accept request hit prod-api at all. The funnel collapses before the backend gets a chance to write: 120 actual cancels → 22 cancellation-reason POSTs → 1 retention-accept POST. Two leak points:
- 120 → 22 (18% reach modal): ~98 cancels in 30d never POST a reason. Best explanation: Stripe Billing Portal direct cancels — users following invoice/billing-portal links cancel inside Stripe and only arrive at our system via the
customer.subscription.updatedwebhook (which flipsstatus='canceled'). They never touch our FE cancel modal. This is structurally outside the cancel handler's reach. - 22 → 1 (4.5% accept): 21 users submitted a reason but did NOT accept any offer. Backend has zero visibility into whether the offer was displayed to them, declined, or never rendered — by design (see Schema gap below).
Defer to frontend for: does the FE call GET /retention-offer after every reason submit, and what does it do when eligible:true comes back?
Side finding (separate ticket-worthy): Five TOBY_RETENTION* secrets do NOT exist in GCP Secret Manager. Every cold start logs "failed to access secret version" for each. Config silently falls back to struct-tag defaults — system works, but the log spam is misleading and ops can't tune RetentionCooldownMonths / RetentionMinSubscriptionDays etc. without a redeploy.
Surface
- Service:
prod-apiCloud Run service, regionus-east4, projecttoby-production-286416, revisionprod-api-00427-9p2(commit-sha4b0107858e706c904e6cf2841fbcbf81a1e2f94f, deployed 2026-04-01, stable since 2026-02-02). - DB tables:
retention_offers,cancellation_reasons,subscriptions(Toby Prod, connectionbe55a66b-c905-4759-9ce1-a97785bb69e6). - Code paths (all in
apps/api/):- Migration:
apps/api/data/migrations/V71__retention_offers.up.sql:1-15 - Model:
apps/api/models/models/retention_offer.go:8-17 - DTOs:
apps/api/models/dtos/retention_dtos.go:14-49 - Routes registration:
apps/api/context/v3/subscription_context.go:36-46 - Cancellation reason write:
apps/api/context/v3/subscription_context.go:508-572 - Eligibility check (read-only):
apps/api/context/v3/subscription_context.go:261-505 - Retention accept write (the ONLY writer):
apps/api/context/v3/subscription_context.go:611-727— row insert at L697-710. - Cancel handler (does NOT touch retention_offers):
apps/api/context/v3/subscription_context.go:75-151. - Eligibility predicate:
apps/api/models/models/cancellation_reason.go:33-37— returns true for any non-empty reason. - Config defaults:
apps/api/config/config.go:185-192. - Secret loader (fail-soft):
apps/api/config/gcp/gcp_processor.go:25-32.
- Migration:
Evidence
DB state (Toby Prod, read-only, today 2026-05-12)
| Query | Result |
|---|---|
SELECT count(*) FROM retention_offers | 17 all-time |
SELECT count(*) FROM retention_offers WHERE created_at > now() - interval '30 days' | 1 (the 2026-05-12 23:17 row) |
SELECT count(*) FROM cancellation_reasons | 238 all-time |
SELECT count(*) FROM cancellation_reasons WHERE created_at > now() - interval '30 days' | 22 |
SELECT count(*) FROM subscriptions WHERE status IN ('canceled') AND updated_at > now() - interval '30 days' | 120 |
SELECT count(*) FROM subscriptions WHERE cancel_at_period_end=true AND updated_at > now() - interval '30 days' | 108 |
Funnel (30d): 120 cancels (status flip) ⇨ 22 reason POSTs (18%) ⇨ 1 accept POST (0.83%).
Monthly trend (retention_offers vs cancellation_reasons):
| Month | reasons | offers accepted |
|---|---|---|
| 2026-05 (partial) | 11 | 1 |
| 2026-04 | 24 | 2 |
| 2026-03 | 47 | 4 |
| 2026-02 | 134 | 10 |
| 2026-01 | 22 | 0 |
The ratio of accept-to-reason has been consistently 4–10% even in healthier months — this is not a recent regression but a long-running funnel reality.
Schema reality (retention_offers)
id uuid NOT NULL default uuid_generate_v4()
team_id uuid NOT NULL
user_id uuid NOT NULL
subscription_id uuid NULL
coupon_id text NOT NULL
interval text NOT NULL
accepted_at timestamptz NOT NULL default now()
created_at timestamptz NOT NULL default now()
No status, no offered_at, no declined_at. Every row is an acceptance — see V71__retention_offers.up.sql:1-15 and the only insert site at subscription_context.go:697-710. The ticket's framing of "0 offers issued, 0 accepted" conflates two different things; from this table you can ONLY measure accepts.
All 17 rows (DESC by created_at)
- 1×
retention_yearly/year— 2026-05-12 23:17 (the monthly→yearly switch path) - 16×
retention_legacy/year— between 2026-02-02 and 2026-04-05, all from distinct teams.
Striking pattern: 16 of 17 all-time accepts are legacy users (coupon_id='retention_legacy'). The non-legacy retention_yearly row is exactly one and it's from today via the monthly→yearly switch branch (subscription_context.go:411-449). There are zero rows where a non-legacy yearly user accepted the retention_yearly discount-on-renewal branch (subscription_context.go:475-497). Either FE doesn't display that offer, or non-legacy yearly users universally decline — backend can't tell.
Live HTTP request log (Cloud Logging)
resource.labels.service_name="prod-api" + httpRequest.requestUrl:"retention-offer" returns only 2 hits going back to 2026-04-13 — both from the same team ce2cc1ac-…-90dfb on 2026-05-12. The full flow we observed for that team:
23:17:07.710 POST .../subscriptions/cancel/reason → 200
23:17:07.861 GET .../subscriptions/cancel/retention-offer → 200 (649 B → eligible:true, offer attached)
23:17:15.189 POST .../subscriptions/cancel/retention/accept → 200 (wrote the row)
23:17:32.517 POST .../subscriptions/cancel/reason → 200 (second submit)
23:17:32.658 GET .../subscriptions/cancel/retention-offer → 200 (554 B → likely cooldown_active now)
23:17:32.828 POST .../subscriptions/cancel → 200 (Stripe portal URL)
This is the canonical happy-path AND demonstrates the cooldown check kicks in: same team retried 17 s later, got smaller payload (eligible:false), and was correctly routed to Stripe portal cancel. Backend behaved exactly as designed.
httpRequest.requestUrl:"/cancel" filter over the same window returned the same 2-team / single-day cluster of cancel flow traffic. There is no "wave" of users hitting the retention endpoints and silently failing — there is a near-total absence of users hitting them at all.
Cloud Run stdout logs
Filter severity>=ERROR over last 30 days mentioning "retention" produces only the spam below — no application errors from the handlers themselves:
failed to access secret version: rpc error: code = NotFound desc =
Secret [projects/144082320709/secrets/TOBY_RETENTIONMINSUBSCRIPTIONDAYS] not found or has no versions.
failed to access secret version: rpc error: code = NotFound desc =
Secret [projects/144082320709/secrets/TOBY_RETENTIONCOOLDOWNMONTHS] not found or has no versions.
failed to access secret version: rpc error: code = NotFound desc =
Secret [projects/144082320709/secrets/TOBY_RETENTIONLEGACYYEARLYPRICE] not found or has no versions.
failed to access secret version: rpc error: code = NotFound desc =
Secret [projects/144082320709/secrets/TOBY_RETENTIONCOUPONLEGACY] not found or has no versions.
failed to access secret version: rpc error: code = NotFound desc =
Secret [projects/144082320709/secrets/TOBY_RETENTIONCOUPONYEARLY] not found or has no versions.
gcloud secrets list --filter=name~TOBY_RETENTION returns []. None of these secrets exist. Loader (gcp_processor.go:25-32) treats missing as ("", false) → envconfig falls back to struct-tag default (config.go:188-192):
| Field | Default used in prod |
|---|---|
RetentionCouponYearly | "retention_yearly" |
RetentionCouponLegacy | "retention_legacy" |
RetentionLegacyYearlyPrice | 36.00 |
RetentionCooldownMonths | 12 |
RetentionMinSubscriptionDays | 30 |
These defaults match what's actually in the retention_offers rows (coupon_id is retention_yearly / retention_legacy), so the system is operating correctly — just on hard-coded defaults rather than Secret Manager values.
Cancel handler does not invoke retention
SubscriptionController.Cancel (subscription_context.go:75-151) goes straight to PaymentSvc.CancelSubscription(team.PaymentCustomerID, subscription.ProviderID) (L138) which returns a Stripe billing-portal URL. No retention check, no offer write, no skip-logged event. The retention flow is entirely FE-orchestrated; if the FE skips GET /retention-offer and goes straight to POST /cancel, the user is gone with zero backend signal.
Eligibility gates (no silent flag found)
validateRetentionEligibility (subscription_context.go:261-505) gates on:
- Subscription must exist + be active (
no_active_subscription) - Must have a
cancellation_reasonsrow (invalid_reason) - If yearly AND
!IsEligibleForRetention(reason)→invalid_reason. BUTIsEligibleForRetention(cancellation_reason.go:33-37) returns true for any non-empty reason — all four enum values (not_using,too_expensive,missing_features,other) pass. - Subscription age ≥
RetentionMinSubscriptionDays(30d default) → elsesubscription_too_new. - No prior retention_offers row within
RetentionCooldownMonths(12mo default) → elsecooldown_active. - Legacy detection via feature flag
cfgBase.legacy2(subscription_context.go:294-312) — feature-flag failure is treated as "non-legacy" (fail-soft), so this can't be silently blocking legacy users.
There is no feature flag, A/B gate, or kill-switch hidden in the eligibility code that's quietly returning eligible:false. The flow either gates on subscription age, cooldown, or missing-reason — all of which are deterministic and visible.
Root-cause hypothesis (high confidence)
Two causes, both non-backend:
-
Most cancels bypass our FE flow entirely. The 120-vs-22 gap (~98 cancels in 30d without a cancellation_reasons row) is best explained by users cancelling through Stripe Billing Portal directly — from invoice emails, billing.stripe.com links, or after
POST /cancelredirects them to the portal and they confirm there. Stripe webhooks (customer.subscription.updated) then flipsubscriptions.statustocanceled. The FE retention modal never has a chance to run. Fix is product/FE — not backend (intercept cancel intent earlier, before the Stripe-portal redirect). -
Of users who do submit a reason, the FE either isn't surfacing the offer or users decline at ~95%. Backend has no telemetry to disambiguate. Looking at the log evidence from team
ce2cc1acon 2026-05-12, the FE does callGET /retention-offerand does callPOST /retention/acceptwhen the user accepts — but we have only ONE such observed sequence in 12+ days of logs. Most "post-reason" sessions appear to never reach the GET. Confirmation needed from FE — see defer_to below.
Schema gap: issued vs accepted
The most important schema fact for this ticket:
retention_offersonly records ACCEPTS. There is no row for "offer was displayed and the user declined", no row for "offer was eligible but the FE never rendered it", and no row whenGET /retention-offerreturnseligible:true.
To answer the business question "how many offers are being shown vs accepted?", we need one of:
- (Backend) Log "offer eligible" events: emit a structured log line at
subscription_context.go:599-607wheneverresult.Eligible == true. Cheap, no schema change. - (Backend) Add
retention_offer_viewstable orstatuscolumn: write a row atGetRetentionOffertime withstatus='offered', update tostatus='accepted'on accept. More work, but proper analytics. - (FE) Mixpanel/Segment event when the offer UI is displayed, declined, or accepted. Probably the right surface — the FE already knows when it renders the offer, and this is a product-funnel question.
Until one of these exists, the ticket's metric "offers issued" cannot be measured at all from retention_offers — full stop.
Defer to frontend
Backend has done its part — the write path is healthy, eligibility is permissive, and the cooldown/age gates are working correctly when they fire. To close TOBY-6 we need the FE doctor to answer:
- Funnel A: For the ~98 cancels in 30d that never POST
/cancel/reason, are they hitting the FE cancel modal at all, or are they cancelling via Stripe portal directly? Does our marketing/billing-email flow even funnel users through our modal first? - Funnel B: For the 21 users in 30d who POSTed
/cancel/reasonbut did NOT POST/retention/accept, did the FE then callGET /retention-offer? Did it receiveeligible:true? Was the offer UI rendered? Did the user click "no thanks"? - Non-legacy yearly users: of the cancellation_reasons in 30d, how many were on non-legacy yearly subs? Were any of them shown the
retention_yearlydiscount-on-renewal offer (subscription_context.go:475-497) and did they decline universally?
Recommended actions
| # | Owner | Action |
|---|---|---|
| 1 | FE doctor | Answer the three funnel questions above. Confirm whether the modal pipeline calls GET /retention-offer after POST /cancel/reason in 100% of sessions, and what the FE does with eligible:true. |
| 2 | Backend (this agent / Toby team) | Add a structured log line in GetRetentionOffer when result.Eligible == true, including teamID, userID, OfferType, CouponID. Pulls one bit of visibility into the funnel without a schema change. ~10 LOC patch in subscription_context.go:599-607. |
| 3 | Backend / DevOps | Separate ticket: create the 5 missing TOBY_RETENTION* secrets in Secret Manager (or rip out the secret-lookup for these and document the defaults). Either way, kill the cold-start log spam — it's misleading on-call signal. |
| 4 | Product | Decide whether to add a retention_offers.status column (or a sibling retention_offer_views table) so "issued" is measurable. Until then, the ticket's metric definition is unanswerable from this schema. |
Open items / unknowns
- Why 134 cancellation_reasons in Feb 2026 but only 22 in Apr 2026? The reason-submit rate has fallen ~6× over 3 months. Either active subscriber churn is genuinely falling, OR the FE flow that captures reason is increasingly being skipped. Worth a separate look. (Backend evidence is healthy in both periods — no deploys to
apps/apisince 2026-02-02.) - Stripe portal direct cancels — confirmed? I'm inferring this from the 120-vs-22 gap. Confirming requires either Stripe Sigma access or correlating the
subscriptions.updated_atwith whether a matchingcancellation_reasonsrow exists in a ~5min window. Worth doing but not strictly required to defer this to FE.