A
AIOS Wiki
read-only · public mirror
Open AIOS
Wikiartifactstoby-backend-doctorf8fd14fa-77ec-4906-8cbd-0dec5f88d26dartifacts/toby-backend-doctor/f8fd14fa-77ec-4906-8cbd-0dec5f88d26d/finding.md

TOBY-6 — Backend finding

Hand-authored·11 min read·15 sections·Last edited May 13 by initial import·View history

TL;DR

Hypothesis B (events not writing) is wrong. The accept-write path is live and working — POST /v3/teams/:id:admin/subscriptions/cancel/retention/accept wrote a row 23 hours ago (2026-05-12 23:17 UTC) and returned 200. The 17 all-time rows are all real, consistent accepts.

Hypothesis A (FE never reaches save-offer step) is strongly supported. In the last 30 days only ONE retention/accept request hit prod-api at all. The funnel collapses before the backend gets a chance to write: 120 actual cancels → 22 cancellation-reason POSTs → 1 retention-accept POST. Two leak points:

  1. 120 → 22 (18% reach modal): ~98 cancels in 30d never POST a reason. Best explanation: Stripe Billing Portal direct cancels — users following invoice/billing-portal links cancel inside Stripe and only arrive at our system via the customer.subscription.updated webhook (which flips status='canceled'). They never touch our FE cancel modal. This is structurally outside the cancel handler's reach.
  2. 22 → 1 (4.5% accept): 21 users submitted a reason but did NOT accept any offer. Backend has zero visibility into whether the offer was displayed to them, declined, or never rendered — by design (see Schema gap below).

Defer to frontend for: does the FE call GET /retention-offer after every reason submit, and what does it do when eligible:true comes back?

Side finding (separate ticket-worthy): Five TOBY_RETENTION* secrets do NOT exist in GCP Secret Manager. Every cold start logs "failed to access secret version" for each. Config silently falls back to struct-tag defaults — system works, but the log spam is misleading and ops can't tune RetentionCooldownMonths / RetentionMinSubscriptionDays etc. without a redeploy.


Surface

  • Service: prod-api Cloud Run service, region us-east4, project toby-production-286416, revision prod-api-00427-9p2 (commit-sha 4b0107858e706c904e6cf2841fbcbf81a1e2f94f, deployed 2026-04-01, stable since 2026-02-02).
  • DB tables: retention_offers, cancellation_reasons, subscriptions (Toby Prod, connection be55a66b-c905-4759-9ce1-a97785bb69e6).
  • Code paths (all in apps/api/):
    • Migration: apps/api/data/migrations/V71__retention_offers.up.sql:1-15
    • Model: apps/api/models/models/retention_offer.go:8-17
    • DTOs: apps/api/models/dtos/retention_dtos.go:14-49
    • Routes registration: apps/api/context/v3/subscription_context.go:36-46
    • Cancellation reason write: apps/api/context/v3/subscription_context.go:508-572
    • Eligibility check (read-only): apps/api/context/v3/subscription_context.go:261-505
    • Retention accept write (the ONLY writer): apps/api/context/v3/subscription_context.go:611-727 — row insert at L697-710.
    • Cancel handler (does NOT touch retention_offers): apps/api/context/v3/subscription_context.go:75-151.
    • Eligibility predicate: apps/api/models/models/cancellation_reason.go:33-37 — returns true for any non-empty reason.
    • Config defaults: apps/api/config/config.go:185-192.
    • Secret loader (fail-soft): apps/api/config/gcp/gcp_processor.go:25-32.

Evidence

DB state (Toby Prod, read-only, today 2026-05-12)

QueryResult
SELECT count(*) FROM retention_offers17 all-time
SELECT count(*) FROM retention_offers WHERE created_at > now() - interval '30 days'1 (the 2026-05-12 23:17 row)
SELECT count(*) FROM cancellation_reasons238 all-time
SELECT count(*) FROM cancellation_reasons WHERE created_at > now() - interval '30 days'22
SELECT count(*) FROM subscriptions WHERE status IN ('canceled') AND updated_at > now() - interval '30 days'120
SELECT count(*) FROM subscriptions WHERE cancel_at_period_end=true AND updated_at > now() - interval '30 days'108

Funnel (30d): 120 cancels (status flip) ⇨ 22 reason POSTs (18%) ⇨ 1 accept POST (0.83%).

Monthly trend (retention_offers vs cancellation_reasons):

Monthreasonsoffers accepted
2026-05 (partial)111
2026-04242
2026-03474
2026-0213410
2026-01220

The ratio of accept-to-reason has been consistently 4–10% even in healthier months — this is not a recent regression but a long-running funnel reality.

Schema reality (retention_offers)

id              uuid       NOT NULL  default uuid_generate_v4()
team_id         uuid       NOT NULL
user_id         uuid       NOT NULL
subscription_id uuid       NULL
coupon_id       text       NOT NULL
interval        text       NOT NULL
accepted_at     timestamptz NOT NULL default now()
created_at      timestamptz NOT NULL default now()

No status, no offered_at, no declined_at. Every row is an acceptance — see V71__retention_offers.up.sql:1-15 and the only insert site at subscription_context.go:697-710. The ticket's framing of "0 offers issued, 0 accepted" conflates two different things; from this table you can ONLY measure accepts.

All 17 rows (DESC by created_at)

  • retention_yearly / year — 2026-05-12 23:17 (the monthly→yearly switch path)
  • 16× retention_legacy / year — between 2026-02-02 and 2026-04-05, all from distinct teams.

Striking pattern: 16 of 17 all-time accepts are legacy users (coupon_id='retention_legacy'). The non-legacy retention_yearly row is exactly one and it's from today via the monthly→yearly switch branch (subscription_context.go:411-449). There are zero rows where a non-legacy yearly user accepted the retention_yearly discount-on-renewal branch (subscription_context.go:475-497). Either FE doesn't display that offer, or non-legacy yearly users universally decline — backend can't tell.

Live HTTP request log (Cloud Logging)

resource.labels.service_name="prod-api" + httpRequest.requestUrl:"retention-offer" returns only 2 hits going back to 2026-04-13 — both from the same team ce2cc1ac-…-90dfb on 2026-05-12. The full flow we observed for that team:

23:17:07.710  POST .../subscriptions/cancel/reason             → 200
23:17:07.861  GET  .../subscriptions/cancel/retention-offer    → 200 (649 B → eligible:true, offer attached)
23:17:15.189  POST .../subscriptions/cancel/retention/accept   → 200 (wrote the row)
23:17:32.517  POST .../subscriptions/cancel/reason             → 200 (second submit)
23:17:32.658  GET  .../subscriptions/cancel/retention-offer    → 200 (554 B → likely cooldown_active now)
23:17:32.828  POST .../subscriptions/cancel                    → 200 (Stripe portal URL)

This is the canonical happy-path AND demonstrates the cooldown check kicks in: same team retried 17 s later, got smaller payload (eligible:false), and was correctly routed to Stripe portal cancel. Backend behaved exactly as designed.

httpRequest.requestUrl:"/cancel" filter over the same window returned the same 2-team / single-day cluster of cancel flow traffic. There is no "wave" of users hitting the retention endpoints and silently failing — there is a near-total absence of users hitting them at all.

Cloud Run stdout logs

Filter severity>=ERROR over last 30 days mentioning "retention" produces only the spam below — no application errors from the handlers themselves:

failed to access secret version: rpc error: code = NotFound desc =
  Secret [projects/144082320709/secrets/TOBY_RETENTIONMINSUBSCRIPTIONDAYS] not found or has no versions.
failed to access secret version: rpc error: code = NotFound desc =
  Secret [projects/144082320709/secrets/TOBY_RETENTIONCOOLDOWNMONTHS] not found or has no versions.
failed to access secret version: rpc error: code = NotFound desc =
  Secret [projects/144082320709/secrets/TOBY_RETENTIONLEGACYYEARLYPRICE] not found or has no versions.
failed to access secret version: rpc error: code = NotFound desc =
  Secret [projects/144082320709/secrets/TOBY_RETENTIONCOUPONLEGACY] not found or has no versions.
failed to access secret version: rpc error: code = NotFound desc =
  Secret [projects/144082320709/secrets/TOBY_RETENTIONCOUPONYEARLY] not found or has no versions.

gcloud secrets list --filter=name~TOBY_RETENTION returns []. None of these secrets exist. Loader (gcp_processor.go:25-32) treats missing as ("", false) → envconfig falls back to struct-tag default (config.go:188-192):

FieldDefault used in prod
RetentionCouponYearly"retention_yearly"
RetentionCouponLegacy"retention_legacy"
RetentionLegacyYearlyPrice36.00
RetentionCooldownMonths12
RetentionMinSubscriptionDays30

These defaults match what's actually in the retention_offers rows (coupon_id is retention_yearly / retention_legacy), so the system is operating correctly — just on hard-coded defaults rather than Secret Manager values.

Cancel handler does not invoke retention

SubscriptionController.Cancel (subscription_context.go:75-151) goes straight to PaymentSvc.CancelSubscription(team.PaymentCustomerID, subscription.ProviderID) (L138) which returns a Stripe billing-portal URL. No retention check, no offer write, no skip-logged event. The retention flow is entirely FE-orchestrated; if the FE skips GET /retention-offer and goes straight to POST /cancel, the user is gone with zero backend signal.

Eligibility gates (no silent flag found)

validateRetentionEligibility (subscription_context.go:261-505) gates on:

  1. Subscription must exist + be active (no_active_subscription)
  2. Must have a cancellation_reasons row (invalid_reason)
  3. If yearly AND !IsEligibleForRetention(reason)invalid_reason. BUT IsEligibleForRetention (cancellation_reason.go:33-37) returns true for any non-empty reason — all four enum values (not_using, too_expensive, missing_features, other) pass.
  4. Subscription age ≥ RetentionMinSubscriptionDays (30d default) → else subscription_too_new.
  5. No prior retention_offers row within RetentionCooldownMonths (12mo default) → else cooldown_active.
  6. Legacy detection via feature flag cfgBase.legacy2 (subscription_context.go:294-312) — feature-flag failure is treated as "non-legacy" (fail-soft), so this can't be silently blocking legacy users.

There is no feature flag, A/B gate, or kill-switch hidden in the eligibility code that's quietly returning eligible:false. The flow either gates on subscription age, cooldown, or missing-reason — all of which are deterministic and visible.


Root-cause hypothesis (high confidence)

Two causes, both non-backend:

  1. Most cancels bypass our FE flow entirely. The 120-vs-22 gap (~98 cancels in 30d without a cancellation_reasons row) is best explained by users cancelling through Stripe Billing Portal directly — from invoice emails, billing.stripe.com links, or after POST /cancel redirects them to the portal and they confirm there. Stripe webhooks (customer.subscription.updated) then flip subscriptions.status to canceled. The FE retention modal never has a chance to run. Fix is product/FE — not backend (intercept cancel intent earlier, before the Stripe-portal redirect).

  2. Of users who do submit a reason, the FE either isn't surfacing the offer or users decline at ~95%. Backend has no telemetry to disambiguate. Looking at the log evidence from team ce2cc1ac on 2026-05-12, the FE does call GET /retention-offer and does call POST /retention/accept when the user accepts — but we have only ONE such observed sequence in 12+ days of logs. Most "post-reason" sessions appear to never reach the GET. Confirmation needed from FE — see defer_to below.

Schema gap: issued vs accepted

The most important schema fact for this ticket:

retention_offers only records ACCEPTS. There is no row for "offer was displayed and the user declined", no row for "offer was eligible but the FE never rendered it", and no row when GET /retention-offer returns eligible:true.

To answer the business question "how many offers are being shown vs accepted?", we need one of:

  • (Backend) Log "offer eligible" events: emit a structured log line at subscription_context.go:599-607 whenever result.Eligible == true. Cheap, no schema change.
  • (Backend) Add retention_offer_views table or status column: write a row at GetRetentionOffer time with status='offered', update to status='accepted' on accept. More work, but proper analytics.
  • (FE) Mixpanel/Segment event when the offer UI is displayed, declined, or accepted. Probably the right surface — the FE already knows when it renders the offer, and this is a product-funnel question.

Until one of these exists, the ticket's metric "offers issued" cannot be measured at all from retention_offers — full stop.


Defer to frontend

Backend has done its part — the write path is healthy, eligibility is permissive, and the cooldown/age gates are working correctly when they fire. To close TOBY-6 we need the FE doctor to answer:

  1. Funnel A: For the ~98 cancels in 30d that never POST /cancel/reason, are they hitting the FE cancel modal at all, or are they cancelling via Stripe portal directly? Does our marketing/billing-email flow even funnel users through our modal first?
  2. Funnel B: For the 21 users in 30d who POSTed /cancel/reason but did NOT POST /retention/accept, did the FE then call GET /retention-offer? Did it receive eligible:true? Was the offer UI rendered? Did the user click "no thanks"?
  3. Non-legacy yearly users: of the cancellation_reasons in 30d, how many were on non-legacy yearly subs? Were any of them shown the retention_yearly discount-on-renewal offer (subscription_context.go:475-497) and did they decline universally?

#OwnerAction
1FE doctorAnswer the three funnel questions above. Confirm whether the modal pipeline calls GET /retention-offer after POST /cancel/reason in 100% of sessions, and what the FE does with eligible:true.
2Backend (this agent / Toby team)Add a structured log line in GetRetentionOffer when result.Eligible == true, including teamID, userID, OfferType, CouponID. Pulls one bit of visibility into the funnel without a schema change. ~10 LOC patch in subscription_context.go:599-607.
3Backend / DevOpsSeparate ticket: create the 5 missing TOBY_RETENTION* secrets in Secret Manager (or rip out the secret-lookup for these and document the defaults). Either way, kill the cold-start log spam — it's misleading on-call signal.
4ProductDecide whether to add a retention_offers.status column (or a sibling retention_offer_views table) so "issued" is measurable. Until then, the ticket's metric definition is unanswerable from this schema.

Open items / unknowns

  • Why 134 cancellation_reasons in Feb 2026 but only 22 in Apr 2026? The reason-submit rate has fallen ~6× over 3 months. Either active subscriber churn is genuinely falling, OR the FE flow that captures reason is increasingly being skipped. Worth a separate look. (Backend evidence is healthy in both periods — no deploys to apps/api since 2026-02-02.)
  • Stripe portal direct cancels — confirmed? I'm inferring this from the 120-vs-22 gap. Confirming requires either Stripe Sigma access or correlating the subscriptions.updated_at with whether a matching cancellation_reasons row exists in a ~5min window. Worth doing but not strictly required to defer this to FE.