---
ticket: TOBY-6
slug: retention-offers-silent
author: toby-backend-doctor
runId: f8fd14fa-77ec-4906-8cbd-0dec5f88d26d
date: 2026-05-12
verdict: defer_to_frontend
hypothesis_b_disconfirmed: true
confidence: high
---

# TOBY-6 — Backend finding

## TL;DR

**Hypothesis B (events not writing) is wrong.** The accept-write path is live and working — `POST /v3/teams/:id:admin/subscriptions/cancel/retention/accept` wrote a row 23 hours ago (2026-05-12 23:17 UTC) and returned 200. The 17 all-time rows are all real, consistent accepts.

**Hypothesis A (FE never reaches save-offer step) is strongly supported.** In the last 30 days only ONE `retention/accept` request hit prod-api at all. The funnel collapses *before* the backend gets a chance to write: 120 actual cancels → 22 cancellation-reason POSTs → 1 retention-accept POST. Two leak points:

1. **120 → 22 (18% reach modal)**: ~98 cancels in 30d never POST a reason. Best explanation: **Stripe Billing Portal direct cancels** — users following invoice/billing-portal links cancel inside Stripe and only arrive at our system via the `customer.subscription.updated` webhook (which flips `status='canceled'`). They never touch our FE cancel modal. This is structurally outside the cancel handler's reach.
2. **22 → 1 (4.5% accept)**: 21 users submitted a reason but did NOT accept any offer. Backend has zero visibility into whether the offer was *displayed* to them, declined, or never rendered — by design (see [Schema gap](#schema-gap-issued-vs-accepted) below).

**Defer to frontend** for: does the FE call `GET /retention-offer` after every reason submit, and what does it do when `eligible:true` comes back?

**Side finding (separate ticket-worthy)**: Five `TOBY_RETENTION*` secrets do NOT exist in GCP Secret Manager. Every cold start logs "failed to access secret version" for each. Config silently falls back to struct-tag defaults — system works, but the log spam is misleading and ops can't tune `RetentionCooldownMonths` / `RetentionMinSubscriptionDays` etc. without a redeploy.

---

## Surface

- **Service**: `prod-api` Cloud Run service, region `us-east4`, project `toby-production-286416`, revision `prod-api-00427-9p2` (commit-sha `4b0107858e706c904e6cf2841fbcbf81a1e2f94f`, deployed 2026-04-01, stable since 2026-02-02).
- **DB tables**: `retention_offers`, `cancellation_reasons`, `subscriptions` (Toby Prod, connection `be55a66b-c905-4759-9ce1-a97785bb69e6`).
- **Code paths** (all in `apps/api/`):
  - Migration: `apps/api/data/migrations/V71__retention_offers.up.sql:1-15`
  - Model: `apps/api/models/models/retention_offer.go:8-17`
  - DTOs: `apps/api/models/dtos/retention_dtos.go:14-49`
  - Routes registration: `apps/api/context/v3/subscription_context.go:36-46`
  - Cancellation reason write: `apps/api/context/v3/subscription_context.go:508-572`
  - Eligibility check (read-only): `apps/api/context/v3/subscription_context.go:261-505`
  - **Retention accept write** (the ONLY writer): `apps/api/context/v3/subscription_context.go:611-727` — row insert at L697-710.
  - Cancel handler (does NOT touch retention_offers): `apps/api/context/v3/subscription_context.go:75-151`.
  - Eligibility predicate: `apps/api/models/models/cancellation_reason.go:33-37` — returns true for any non-empty reason.
  - Config defaults: `apps/api/config/config.go:185-192`.
  - Secret loader (fail-soft): `apps/api/config/gcp/gcp_processor.go:25-32`.

---

## Evidence

### DB state (Toby Prod, read-only, today 2026-05-12)

| Query | Result |
|---|---|
| `SELECT count(*) FROM retention_offers` | **17** all-time |
| `SELECT count(*) FROM retention_offers WHERE created_at > now() - interval '30 days'` | **1** (the 2026-05-12 23:17 row) |
| `SELECT count(*) FROM cancellation_reasons` | **238** all-time |
| `SELECT count(*) FROM cancellation_reasons WHERE created_at > now() - interval '30 days'` | **22** |
| `SELECT count(*) FROM subscriptions WHERE status IN ('canceled') AND updated_at > now() - interval '30 days'` | **120** |
| `SELECT count(*) FROM subscriptions WHERE cancel_at_period_end=true AND updated_at > now() - interval '30 days'` | **108** |

**Funnel (30d)**: 120 cancels (status flip) ⇨ 22 reason POSTs (18%) ⇨ 1 accept POST (0.83%).

**Monthly trend (retention_offers vs cancellation_reasons)**:

| Month | reasons | offers accepted |
|---|---|---|
| 2026-05 (partial) | 11 | 1 |
| 2026-04 | 24 | 2 |
| 2026-03 | 47 | 4 |
| 2026-02 | 134 | 10 |
| 2026-01 | 22 | 0 |

The ratio of accept-to-reason has been consistently 4–10% even in healthier months — this is not a recent regression but a long-running funnel reality.

### Schema reality (`retention_offers`)

```text
id              uuid       NOT NULL  default uuid_generate_v4()
team_id         uuid       NOT NULL
user_id         uuid       NOT NULL
subscription_id uuid       NULL
coupon_id       text       NOT NULL
interval        text       NOT NULL
accepted_at     timestamptz NOT NULL default now()
created_at      timestamptz NOT NULL default now()
```

**No `status`, no `offered_at`, no `declined_at`.** Every row is an acceptance — see `V71__retention_offers.up.sql:1-15` and the only insert site at `subscription_context.go:697-710`. The ticket's framing of "0 offers issued, 0 accepted" conflates two different things; from this table you can ONLY measure accepts.

### All 17 rows (DESC by created_at)

- **1×** `retention_yearly` / `year` — 2026-05-12 23:17 (the monthly→yearly switch path)
- **16×** `retention_legacy` / `year` — between 2026-02-02 and 2026-04-05, all from distinct teams.

**Striking pattern**: 16 of 17 all-time accepts are legacy users (`coupon_id='retention_legacy'`). The non-legacy `retention_yearly` row is exactly one and it's from today via the monthly→yearly switch branch (subscription_context.go:411-449). There are **zero rows** where a non-legacy *yearly* user accepted the `retention_yearly` discount-on-renewal branch (subscription_context.go:475-497). Either FE doesn't display that offer, or non-legacy yearly users universally decline — backend can't tell.

### Live HTTP request log (Cloud Logging)

`resource.labels.service_name="prod-api"` + `httpRequest.requestUrl:"retention-offer"` returns only **2 hits** going back to 2026-04-13 — both from the same team `ce2cc1ac-…-90dfb` on 2026-05-12. The full flow we observed for that team:

```
23:17:07.710  POST .../subscriptions/cancel/reason             → 200
23:17:07.861  GET  .../subscriptions/cancel/retention-offer    → 200 (649 B → eligible:true, offer attached)
23:17:15.189  POST .../subscriptions/cancel/retention/accept   → 200 (wrote the row)
23:17:32.517  POST .../subscriptions/cancel/reason             → 200 (second submit)
23:17:32.658  GET  .../subscriptions/cancel/retention-offer    → 200 (554 B → likely cooldown_active now)
23:17:32.828  POST .../subscriptions/cancel                    → 200 (Stripe portal URL)
```

This is the canonical happy-path AND demonstrates the cooldown check kicks in: same team retried 17 s later, got smaller payload (`eligible:false`), and was correctly routed to Stripe portal cancel. **Backend behaved exactly as designed.**

`httpRequest.requestUrl:"/cancel"` filter over the same window returned the same 2-team / single-day cluster of cancel flow traffic. There is no "wave" of users hitting the retention endpoints and silently failing — there is a near-total absence of users hitting them at all.

### Cloud Run stdout logs

Filter `severity>=ERROR` over last 30 days mentioning "retention" produces only the spam below — no application errors from the handlers themselves:

```
failed to access secret version: rpc error: code = NotFound desc =
  Secret [projects/144082320709/secrets/TOBY_RETENTIONMINSUBSCRIPTIONDAYS] not found or has no versions.
failed to access secret version: rpc error: code = NotFound desc =
  Secret [projects/144082320709/secrets/TOBY_RETENTIONCOOLDOWNMONTHS] not found or has no versions.
failed to access secret version: rpc error: code = NotFound desc =
  Secret [projects/144082320709/secrets/TOBY_RETENTIONLEGACYYEARLYPRICE] not found or has no versions.
failed to access secret version: rpc error: code = NotFound desc =
  Secret [projects/144082320709/secrets/TOBY_RETENTIONCOUPONLEGACY] not found or has no versions.
failed to access secret version: rpc error: code = NotFound desc =
  Secret [projects/144082320709/secrets/TOBY_RETENTIONCOUPONYEARLY] not found or has no versions.
```

`gcloud secrets list --filter=name~TOBY_RETENTION` returns `[]`. None of these secrets exist. Loader (`gcp_processor.go:25-32`) treats missing as `("", false)` → envconfig falls back to struct-tag default (`config.go:188-192`):

| Field | Default used in prod |
|---|---|
| `RetentionCouponYearly` | `"retention_yearly"` |
| `RetentionCouponLegacy` | `"retention_legacy"` |
| `RetentionLegacyYearlyPrice` | `36.00` |
| `RetentionCooldownMonths` | `12` |
| `RetentionMinSubscriptionDays` | `30` |

These defaults match what's actually in the `retention_offers` rows (`coupon_id` is `retention_yearly` / `retention_legacy`), so the system *is* operating correctly — just on hard-coded defaults rather than Secret Manager values.

### Cancel handler does not invoke retention

`SubscriptionController.Cancel` (`subscription_context.go:75-151`) goes straight to `PaymentSvc.CancelSubscription(team.PaymentCustomerID, subscription.ProviderID)` (L138) which returns a Stripe billing-portal URL. **No retention check, no offer write, no skip-logged event.** The retention flow is entirely FE-orchestrated; if the FE skips `GET /retention-offer` and goes straight to `POST /cancel`, the user is gone with zero backend signal.

### Eligibility gates (no silent flag found)

`validateRetentionEligibility` (subscription_context.go:261-505) gates on:
1. Subscription must exist + be active (`no_active_subscription`)
2. Must have a `cancellation_reasons` row (`invalid_reason`)
3. If yearly AND `!IsEligibleForRetention(reason)` → `invalid_reason`. **BUT** `IsEligibleForRetention` (cancellation_reason.go:33-37) returns true for any non-empty reason — *all four* enum values (`not_using`, `too_expensive`, `missing_features`, `other`) pass.
4. Subscription age ≥ `RetentionMinSubscriptionDays` (30d default) → else `subscription_too_new`.
5. No prior retention_offers row within `RetentionCooldownMonths` (12mo default) → else `cooldown_active`.
6. Legacy detection via feature flag `cfgBase.legacy2` (subscription_context.go:294-312) — feature-flag failure is treated as "non-legacy" (fail-soft), so this can't be silently blocking legacy users.

There is **no feature flag, A/B gate, or kill-switch hidden in the eligibility code** that's quietly returning `eligible:false`. The flow either gates on subscription age, cooldown, or missing-reason — all of which are deterministic and visible.

---

## Root-cause hypothesis (high confidence)

**Two causes, both non-backend:**

1. **Most cancels bypass our FE flow entirely.** The 120-vs-22 gap (~98 cancels in 30d without a cancellation_reasons row) is best explained by users cancelling through Stripe Billing Portal directly — from invoice emails, billing.stripe.com links, or after `POST /cancel` redirects them to the portal and they confirm there. Stripe webhooks (`customer.subscription.updated`) then flip `subscriptions.status` to `canceled`. The FE retention modal never has a chance to run. **Fix is product/FE — not backend** (intercept cancel intent earlier, before the Stripe-portal redirect).

2. **Of users who do submit a reason, the FE either isn't surfacing the offer or users decline at ~95%.** Backend has no telemetry to disambiguate. Looking at the log evidence from team `ce2cc1ac` on 2026-05-12, the FE *does* call `GET /retention-offer` and *does* call `POST /retention/accept` when the user accepts — but we have only ONE such observed sequence in 12+ days of logs. Most "post-reason" sessions appear to never reach the GET. **Confirmation needed from FE** — see [defer_to](#defer-to) below.

## Schema gap: issued vs accepted

The most important schema fact for this ticket:

> **`retention_offers` only records ACCEPTS.** There is no row for "offer was displayed and the user declined", no row for "offer was eligible but the FE never rendered it", and no row when `GET /retention-offer` returns `eligible:true`.

To answer the business question "how many offers are being shown vs accepted?", we need one of:

- **(Backend) Log "offer eligible" events**: emit a structured log line at `subscription_context.go:599-607` whenever `result.Eligible == true`. Cheap, no schema change.
- **(Backend) Add `retention_offer_views` table or `status` column**: write a row at `GetRetentionOffer` time with `status='offered'`, update to `status='accepted'` on accept. More work, but proper analytics.
- **(FE) Mixpanel/Segment event** when the offer UI is displayed, declined, or accepted. Probably the right surface — the FE already knows when it renders the offer, and this is a product-funnel question.

Until one of these exists, **the ticket's metric "offers issued" cannot be measured at all from `retention_offers` — full stop.**

---

## Defer to frontend

Backend has done its part — the write path is healthy, eligibility is permissive, and the cooldown/age gates are working correctly when they fire. To close TOBY-6 we need the FE doctor to answer:

1. **Funnel A**: For the ~98 cancels in 30d that never POST `/cancel/reason`, are they hitting the FE cancel modal at all, or are they cancelling via Stripe portal directly? Does our marketing/billing-email flow even funnel users through our modal first?
2. **Funnel B**: For the 21 users in 30d who POSTed `/cancel/reason` but did NOT POST `/retention/accept`, did the FE then call `GET /retention-offer`? Did it receive `eligible:true`? Was the offer UI rendered? Did the user click "no thanks"?
3. **Non-legacy yearly users**: of the cancellation_reasons in 30d, how many were on non-legacy yearly subs? Were any of them shown the `retention_yearly` discount-on-renewal offer (subscription_context.go:475-497) and did they decline universally?

---

## Recommended actions

| # | Owner | Action |
|---|---|---|
| 1 | FE doctor | Answer the three funnel questions above. Confirm whether the modal pipeline calls `GET /retention-offer` after `POST /cancel/reason` in 100% of sessions, and what the FE does with `eligible:true`. |
| 2 | Backend (this agent / Toby team) | **Add a structured log line** in `GetRetentionOffer` when `result.Eligible == true`, including `teamID`, `userID`, `OfferType`, `CouponID`. Pulls one bit of visibility into the funnel without a schema change. ~10 LOC patch in subscription_context.go:599-607. |
| 3 | Backend / DevOps | Separate ticket: create the 5 missing `TOBY_RETENTION*` secrets in Secret Manager (or rip out the secret-lookup for these and document the defaults). Either way, kill the cold-start log spam — it's misleading on-call signal. |
| 4 | Product | Decide whether to add a `retention_offers.status` column (or a sibling `retention_offer_views` table) so "issued" is measurable. Until then, the ticket's metric definition is unanswerable from this schema. |

---

## Open items / unknowns

- **Why 134 cancellation_reasons in Feb 2026 but only 22 in Apr 2026?** The reason-submit rate has fallen ~6× over 3 months. Either active subscriber churn is genuinely falling, OR the FE flow that captures reason is increasingly being skipped. Worth a separate look. (Backend evidence is healthy in both periods — no deploys to `apps/api` since 2026-02-02.)
- **Stripe portal direct cancels — confirmed?** I'm inferring this from the 120-vs-22 gap. Confirming requires either Stripe Sigma access or correlating the `subscriptions.updated_at` with whether a matching `cancellation_reasons` row exists in a ~5min window. Worth doing but not strictly required to defer this to FE.
