Skip to content

Instantly share code, notes, and snippets.

@trevorh
Last active April 27, 2026 23:19
Show Gist options
  • Select an option

  • Save trevorh/41ecc004fd3db7037904bfd489f20c9e to your computer and use it in GitHub Desktop.

Select an option

Save trevorh/41ecc004fd3db7037904bfd489f20c9e to your computer and use it in GitHub Desktop.
Family Perspective PRD v3 — commerce-foundation-pro agreement model, full 7-layer pipeline, state machine recommendation

Family Perspective: Product Requirements Document (v3)

Date: 2026-04-27 Author: Trevor Hart Status: Draft Supersedes: family-perspective-prd-v2.md (2026-04-27), family-perspective-prd.md (v1, 2026-04-24) Audience: CDI leadership, Subscriber Analytics Engineering, CDW team, Commerce Foundation team, analytics stakeholders

Key changes from v2: Expanded scope from the base subscriber dimension to the full unified pipeline family (state transitions, growth movements, growth accounting, LMA, aggregate balance, metrics). Added assessment of the NRT timing mismatch for supplementary fields. Added state machine recommendation for movement classification. Resolved remaining question about ad_state/tv_state derivability.


1. Problem Statement

Unchanged from v2. Disney streaming analytics operates across four services (Disney+, Hulu, ESPN+, Star+) but has no single table where an analyst can answer the question: "What does this subscriber look like across all their streaming services?"

What exists today

Perspective Table Grain What it shows What it doesn't show
Service Perspective SERVICE_SUBSCRIBER_DIM_VW (83 consumers) subscriber x service x date One row per subscriber per streaming service per day. Full attribute detail per service. No cross-service view. Must self-join across 4 service rows.
Unified Perspective UNIFIED_SUBSCRIBER_DIM_VW (46 consumers) subscriber x date (deduplicated) One row per subscriber per day, cross-service. Picks a "primary" subscription. Only includes subscribers with a Disney+ subscription. Primary selection loses non-primary detail.
Conformed Composite (wide engagement table) subscriber x date Combines subscriber state with engagement metrics. Too wide for subscription-focused analysis. Mixes concerns.
Unified v2 (in QA) fact_unified_v2_subscribers_vw (48 cols) subscriber x date All subscribers across services. Family-level flags. Built on dim_subscribers_scd via 4-way self-join. 10 UAT bugs, 4 unresolved after 9 months. "Multiple Values" pattern. 4-day backfill.

The gap

Three specific audiences are blocked. See v2 Section 1 for the full question mapping.

Why this matters now

  1. ESPN Flagship launch — every hardcoded h/e/d/s pattern must be extended.
  2. Commerce-foundation-pro agreement model is production-ready — hourly refresh, all subscribers including unactivated.
  3. Unified_v2 UAT stuck for 9 months — root cause is the self-join architecture.

2. Data Architecture Landscape

Unchanged from v2. Three independent pipelines from DATOS:

  1. commerce-foundation-pro (intended source of truth, hourly, all subscribers)
  2. subscriber-fuze (legacy parallel, daily, identity tables cover activated only)
  3. dim_subscribers_scd (current foundation, daily, all subscribers, full history)

See v2 Section 2 for full details including the identity_id coverage gap (70M+ unactivated, 15M actively paying) and the NULL is_disney flag finding.


3. The Full Unified Pipeline Family

The "Family Perspective" is not just a base subscriber dimension. The current unified pipeline is a 7-layer stack, and the Family Perspective must eventually provide equivalents for the full stack:

Current unified pipeline layers

Layer 1: dim_unified_subscribers_scd (5.58B rows)
    Foundation: subscriber x date SCD with previous_* columns
    |
Layer 2: fact_unified_subscriber_state_transitions
    Grain: subscriber x transition_date
    Detection: SCD row boundaries (each new SCD row = a transition)
    Output: 53 change-detection flags (current vs previous for each attribute)
    |
Layer 3: fact_unified_subscriber_growth_movements (5.77B rows)
    Grain: subscriber x transition_date
    Classification: JOIN to dim_subscriber_transition_movement_mapping
    Output: 46 binary movement flags + LMA exclusion flags
    Movement categories: 60+ types across signups, conversions, cancels,
    pauses, product switches, service changes, at-risk transitions
    |
Layer 4: fact_unified_subscriber_dimension_movements (3.97B rows)
    Grain: subscriber x date
    Tracks: every dimensional attribute change with upgrade/downgrade flags
    |
Layer 5: fact_unified_subscriber_growth_accounting (92.9M rows)
    Grain: date x dimensions (NO subscriber_sk_id)
    Aggregation: movements + LMA adjustments
    |
Layer 6: fact_unified_subscriber_aggregate_balance (2.80B rows)
    Grain: date x dimensions
    Calculation: starting balance + movements = ending balance
    |
Layer 7: fact_unified_subscriber_metrics (2.66B rows, 100+ KPIs)
    Grain: date x dimensions
    Output: final aggregated metrics for BI/Looker/executive reporting

Daily execution chain

07:00 UTC  dim_unified_subscribers_daily_load (Layer 1)
08:00 UTC  fact_unified_subscriber_facts_load (Layers 2-4)
  parallel:  dimension_movements_agg_load
  parallel:  growth_accounting_load (Layer 5)
  parallel:  aggregate_balance_load (Layer 6)
08:00 UTC  fact_unified_subscriber_metrics_load (Layer 7)
09:00 UTC  fact_unified_v2_subscribers_daily_load (the v2 base table)

Service-level equivalents

Each unified layer has a service-level counterpart:

Unified Service-Level
dim_unified_subscribers_scd dim_subscribers_scd (per streaming_service)
fact_unified_subscriber_growth_movements fact_service_subscriber_growth_movements
fact_unified_subscriber_growth_movements_lma fact_service_subscriber_growth_movements_lma
fact_unified_subscriber_growth_accounting fact_service_subscriber_growth_accounting
fact_unified_subscriber_dimension_movements fact_service_subscriber_dimension_movements
fact_unified_subscriber_aggregate_balance fact_service_subscriber_aggregate_balance
fact_unified_subscriber_metrics fact_service_subscriber_metrics

4. Requirements

4.1 Functional Requirements (expanded from v2)

Phase 1 (MVP): Base Subscriber Dimension

R1-R7 unchanged from v2. Summary:

  • R1: Cross-service subscriber state (paid/entitled/trial across all services)
  • R2: Cross-service product attributes (preserve actual values, not "Multiple Values")
  • R3: Service-level drill-down (filter, not helper tables)
  • R4: Bundle and multi-subscription classification
  • R5: All subscribers included (including 70M+ unactivated 3P partners)
  • R6: Identity linkage (identity_id where available, agreement_id as surrogate)
  • R7: Service extensibility (new services = new rows, not new columns)

Phase 2: State Transitions and Growth Movements

"When this subscriber's state changed, what happened?"

  • R8: Family-level state transitions — detect day-over-day changes in the subscriber's cross-service state. When a subscriber goes from "paid on Disney+ and Hulu" to "paid on Disney+ only", that's a family-level transition.
  • R9: Growth movement classification — classify each transition with a movement category (signup, cancel, pause, upgrade, downgrade, service change, etc.) using mapping-table-driven logic, not hardcoded CASE WHEN blocks.
  • R10: LMA (Last Mile Adjustments) — apply partnership-specific exclusion rules for financial reporting accuracy.

Phase 3: Growth Accounting, Balance, and Metrics

"What are the aggregate KPIs for the family perspective?"

  • R11: Growth accounting — aggregated movement counts by date and dimensional attributes, with LMA adjustments applied.
  • R12: Aggregate balance — running subscriber counts (starting + adds - stops = ending) at the family level.
  • R13: Metric store — 100+ pre-aggregated KPIs for Looker and executive dashboards.

4.2 Non-Functional Requirements (expanded from v2)

NR1-NR5 unchanged from v2. Added:

NR6: Movement classification maintainability

  • Movement categories must be defined in mapping tables, not SQL CASE WHEN blocks.
  • Adding a new movement category (e.g., for a new product type) should require only a DML INSERT to the mapping table, not a code change.
  • Mapping tables must be version-controlled as seed DML in the repo, not managed as production-only data.

NR7: State machine formalization

  • Valid subscriber states and valid transitions between them should be defined as a formal state machine stored in a reference table.
  • Any observed transition not in the state machine should be flagged as "unexpected" rather than silently falling into a catch-all "Other" bucket.
  • The state machine table serves as both runtime logic and documentation.

5. Recommended Architecture

Phase 1: Base Subscriber Dimension

Updated from v2 to address the NRT timing mismatch for supplementary fields.

DATOS raw events
    |
commerce-foundation-pro (Databricks, hourly)
    |
    +-- agreement_product_dim (agreement x product SCD, Delta + Iceberg)
    +-- agreement_dim (agreement SCD, carries identity_id)
    |
    |   Iceberg auto-refresh to Snowflake
    v
commerce-core-ds-pro (Snowflake)
    |
    +-- agreement_product_dim Iceberg table (auto-refreshed)
    +-- agreement_dim Iceberg table (auto-refreshed)
    |
    +-- family_subscriber_summary_vw (NEW VIEW)
    |     Grain: subscriber_sk_id x business_date
    |     Source: agreement tables (primary) + dim_subscribers_scd (supplementary)
    |
    +-- Looker explores join to agreement_product_dim for service drill-down

Supplementary fields: the NRT timing problem and solution

Four fields (subscriber_ad_state, subscriber_tv_state, subscriber_home_country, is_activated) are not on the agreement tables and currently require a join to dim_subscribers_scd. However, the agreement tables refresh hourly while dim_subscribers_scd is daily batch — creating a window where new subscribers appear in the agreement data but have no SCD match.

Resolution: hybrid derivation with COALESCE

For each field, derive an NRT approximation from the agreement data and fall back to the SCD value when available:

Field NRT Approximation (from agreement data) Batch Exact (from SCD) How
subscriber_ad_state Derive from offer.labels array (has_ads label) dim_subscribers_scd.subscriber_ad_state Requires upstream change: add offer_labels extraction to commerce-foundation-pro's agreement_product_read_optimized_stage. The ESPN Unlimited pipeline in the same repo already extracts offer.labels — same pattern. ~99% accuracy for post-unification subscriptions.
subscriber_tv_state Derive from offer.labels array (live_tv, flagship labels) dim_subscribers_scd.subscriber_tv_state Same upstream change. ~90% accuracy — gap is legacy Hulu live TV subs pre-ACP, which is a shrinking population within the 2025+ MVP scope.
subscriber_home_country Use storefront_country from agreement_product_dim dim_subscribers_scd.subscriber_home_country Already available on agreement tables. ~85-95% match vs 6-source priority resolution.
is_activated identity_id IS NOT NULL on agreement_dim dim_subscribers_scd.is_activated Reasonable proxy for activation.
-- Pattern for each field:
COALESCE(
    scd.subscriber_ad_state,                         -- exact, available after daily batch
    derived_ad_state_from_offer_labels               -- approximate, available within 1 hour
) AS subscriber_ad_state

New subscribers get the approximation immediately; it upgrades to the exact value within 24 hours. Analysts see complete data at all times.

Upstream prerequisite: Commerce-foundation-pro team adds offer.labels extraction to the agreement_product_read_optimized_stage application config. This is a single-line addition to the column map — the field exists in the DATOS payload and the extraction pattern is already used in the ESPN Unlimited pipeline in the same repo.

Phase 2: State Transitions and Growth Movements

agreement_product_dim (SCD with effective dates)
    |
    +-- Each new SCD row = a product-level state transition
    |   Previous state: prior SCD row's product_state
    |   Current state: current SCD row's product_state
    |
    +-- dim_product_state_movement_mapping (NEW MAPPING TABLE)
    |     Key: (prev_product_state, curr_product_state)
    |     Output: movement_category, movement_description
    |     ~70 rows covering all valid state pairs
    |     Version-controlled as seed DML in the repo
    |
    +-- family_state_machine (NEW REFERENCE TABLE)
    |     Defines: valid family-level states, valid transitions, labels
    |     States: combinations of per-product states across services
    |     Transitions: edges in the state graph
    |     Unexpected transitions flagged rather than silently bucketed
    |
    v
family_subscriber_transition_fact_vw (NEW VIEW)
    Grain: subscriber_sk_id x transition_date
    Movement classification via mapping table JOIN (not CASE WHEN)
    |
    v
family_subscriber_growth_movement_fact_vw (NEW VIEW)
    Adds: LMA exclusion flags, movement direction

Why mapping tables and state machines instead of CASE WHEN

The current unified pipeline has 214 CASE WHEN branches spread across its views:

Component Branches Purpose
State transitions (service) 52 Mechanical change detection
State transitions (unified) 53 Same + cross-service
Identity entitlement movements 64 State-pair to movement (hardcoded, has catch-all "Other")
DTCI terminal state mapping 23 State collapsing
LMA exclusions ~10 Partner-specific rules
LMA DML 12 Partnership minimum guarantees

The identity entitlement movements view was built iteratively (7 commits, multiple "fix dup issue" messages) — new state pairs added as they appeared in production data. The movement mapping table (dim_subscriber_transition_movement_mapping) is the one well-designed component: 46 binary flags keyed by (prev_state, curr_state, last_entitled_over_one_year_ago, has_most_recent_entitled_date_flag). However, it is production-managed and not version-controlled — changes require DML against the production table.

The recommended approach for Family Perspective:

  1. Mapping table for movement classificationdim_product_state_movement_mapping with ~70 rows. Changes = INSERT/UPDATE on the table, not code changes. Version- controlled as seed DML.

  2. Formal state machine for family states — defines all valid (prev_state, curr_state) pairs with transition labels. Any observed transition not in the table is flagged as "unexpected" rather than falling into "Other". The state machine table is both runtime logic and documentation.

  3. Event-sourced input — the agreement model is already SCD-based with effective dates. Each new SCD row IS a transition. No day-over-day snapshot diffing needed.

This pattern is established at scale: Stripe uses formal state machines for billing lifecycle; dbt Labs' subscriber metrics methodology recommends state-graph-driven transition classification.

Phase 3: Growth Accounting, Balance, and Metrics

These layers are pure aggregation — they GROUP BY date and dimensional attributes, computing sums of movement counts and running balances. The architecture is identical to the current unified pipeline; only the input changes (family movement facts instead of unified movement facts).

The LMA (Last Mile Adjustment) logic is the most complex component: 3,554 lines of partnership-specific exclusion rules. This must be reviewed separately to determine whether the same rules apply at the family level or need adaptation. LMA rules are currently hardcoded in SQL and maintained via a cloned production table.


6. Gap Analysis: Full Pipeline Stack

What commerce-foundation-pro provides vs. what each layer needs

Layer Currently reads from Commerce-foundation-pro provides Gap
L1: Base subscriber dim dim_subscribers_scd (4-way self-join) agreement_product_dim (product rows, GROUP BY) Covered. Supplementary fields via COALESCE pattern.
L2: State transitions SCD row boundaries agreement SCD has effective dates Covered. SCD rows define transitions natively.
L3: Growth movements mapping table + transitions Events have state_src/state_scd Covered with new mapping table. 64-branch CASE WHEN replaced by ~70-row lookup.
L4: Dimension movements SCD attribute diffs agreement SCD has product/offer attributes Partially covered. 40+ dimensional attributes tracked currently; agreement model has rich product metadata but different column names. Needs column mapping.
L5: Growth accounting movements + LMA Movements (from L3) + LMA rules Gap: LMA rules. 3,554 lines of partnership-specific logic. Needs review for family-level applicability.
L6: Aggregate balance growth accounting rollup Pure aggregation Covered once L5 is built.
L7: Metrics balance + facts Pure aggregation Covered once L6 is built.

Key gap: LMA (Last Mile Adjustments)

LMA rules are partnership-specific exclusions applied to growth movements before they feed into growth accounting. Examples:

  • 3PP aggregate partnership free trial caps
  • Disney partnership minimum guarantee billing platform rules
  • Date-gated partner-specific exclusions (Movistar, Vodafone, Docomo, Astro, Hotstar)

These rules are maintained as:

  1. A cloned production table (dim_subscriber_transition_movement_mapping)
  2. ~10 inline CASE WHEN branches in the unified transitions view
  3. 3,554 lines of dedicated LMA DML

For the Family Perspective, LMA rules need to be:

  1. Reviewed for family-level applicability (some may only apply at service level)
  2. Migrated to version-controlled mapping tables
  3. Applied to the family movement facts

This is the single largest complexity item beyond the base table and should be scoped as a separate workstream.


7. Phased Delivery Plan

Phase Scope Dependencies Effort
Phase 1 Base subscriber dimension (family_subscriber_summary_vw) Iceberg tables for agreement_product_dim + agreement_dim; offer_labels extraction; Alan repoints views Medium (6-9 days for view, 8-12 for Alan's repoint)
Phase 2a State transitions + growth movements Phase 1; product state mapping table; state machine definition Medium (mapping table design + view SQL)
Phase 2b Dimension movements Phase 1; column mapping from agreement model to current 40+ attributes Medium
Phase 3a Growth accounting (without LMA) Phase 2a; aggregation views Low (pure GROUP BY aggregation)
Phase 3b LMA rules migration Phase 3a; LMA rule review with partnership team High (3,554 lines of partnership logic)
Phase 4 Aggregate balance + metrics Phase 3a/3b; running balance calculation Medium
Phase 5 Publish to subscriber360 + Looker integration Phase 1 minimum; Phase 4 for full stack Medium

MVP: Phase 1

The base subscriber dimension covers the immediate analytics need and unblocks all three stakeholder audiences identified in the problem statement. Phases 2-5 build the full growth accounting and metrics stack but are not required for the initial launch.


8. Resolved and Remaining Questions

Resolved (since v2)

# Question Resolution
1 Can ad_state be derived from the agreement model? Yes, with upstream change. offer.labels in DATOS contains has_ads. Commerce-foundation-pro needs to add label extraction to the agreement pipeline (single-line config change; pattern already exists in ESPN Unlimited pipeline). ~99% accuracy for post-unification subscriptions.
2 Can tv_state be derived from the agreement model? Mostly yes. offer.labels contains live_tv and flagship. Same upstream change. ~90% accuracy — gap is legacy Hulu live TV pre-ACP (shrinking, within 2025+ scope).
3 NRT timing mismatch for supplementary fields? Solved with COALESCE pattern. NRT approximations from agreement data; exact values fill in from dim_subscribers_scd within 24 hours.
4 What's the scope beyond the base table? 7-layer pipeline. State transitions, growth movements, growth accounting (with LMA), aggregate balance, metrics. Phases 2-5 in the delivery plan.
5 Is there a plan to carry ad_state/tv_state/home_country/activation natively? Architectural blockers for native computation. ad_state and tv_state require multi-subscription aggregation + SKU metadata + Hulu MVP data. home_country requires 6-source priority resolution. is_activated requires account-level activation dates from a separate DATOS domain. The COALESCE pattern is the right long-term approach — NRT approximation + batch exact.

Remaining

# Question Who Impact
1 Will commerce-foundation-pro add offer.labels extraction to the agreement pipeline? Commerce Foundation team Prerequisite for NRT ad_state/tv_state derivation
2 How should the partner field be derived from agreement_source_partner in agreement_dim? Commerce Foundation team / CDW Determines service letter derivation
3 Will Snowflake Iceberg tables be created for agreement_product_dim and agreement_dim? Commerce Foundation team Prerequisite for Snowflake consumption
4 Do LMA rules apply at the family level or only service level? Partnership team / CDW Determines Phase 3b scope
5 What Looker dashboards would use the family perspective? Subscriber AE (Kate C.), Bundle Analytics (Jeff L.) Determines view vs materialized table
6 Are family-level growth movements in scope for MVP or Phase 2? CDI leadership Scopes initial deliverable

Appendix A: Source Documents

Document Description
Family Perspective - Design Requirements Field specs, scenarios, sample queries (2025)
Unified v2 Validation Tracker UAT issues, project plan
Unified_v2 Test Results QE test cases
Family Perspective Slides Design overview presentation
subscriber360 Domain Narrative MMM-151 workspace documentation (13 clusters, 514 views)
identity_id vs subscriber_sk_id Findings Data investigation: 70M+ unactivated subscribers, subscription_id 1:1 mapping

Appendix B: PRD Evolution

Version Date Key changes
v1 2026-04-24 Initial PRD. Foundation: subscriber-fuze identity_agreement tables. Scope: base table only.
v2 2026-04-27 Corrected foundation to commerce-foundation-pro. Added identity_id coverage gap analysis. Scoped to 2025+ for MVP.
v3 2026-04-27 Expanded to full 7-layer pipeline. Added state machine recommendation. Resolved NRT timing mismatch. Added phased delivery plan.

Appendix C: Companion Technical Artifacts

Artifact Purpose Status
family-perspective-rebase-proposal.md v1 technical proposal (subscriber-fuze based) Superseded by this PRD; design options still valid for reference
family-perspective-rebase-visual.html v1 visual architecture diagrams Needs update for commerce-foundation-pro + full pipeline architecture
family-perspective-sql-implementation.md v1 SQL implementation Needs update: source changes, mapping table pattern, COALESCE pattern

Appendix D: Current Pipeline Complexity (for reference)

CASE WHEN branch count across unified pipeline

Component Branches Purpose
State transitions (service-level) 52 Mechanical change detection (prev vs current)
State transitions (unified) 53 Same + cross-service states
Identity entitlement movements 64 State-pair to movement mapping (hardcoded, catch-all "Other")
DTCI terminal state mapping 23 Terminal state collapsing
Unified transitions LMA exclusion ~10 Partner-specific exclusions
LMA DML 12 Partnership minimum guarantees
Total ~214

Key reference tables

Table Key Output Managed
dim_subscriber_transition_movement_mapping (prev_state, curr_state, last_entitled_1y, has_mre_flag) 46 binary movement flags Production-cloned (not version-controlled)
dim_subscriber_transition_breakout_mapping (is_country_change, is_product_switch, is_service_change, is_paid_to_unpaid) movement_direction (in/out) Production-cloned
dim_subscription_subscriber_state_mapping (partner, subscription_state) subscriber_state, stacking_priority, is_entitled, paid_state Production-cloned

Appendix E: Data Pipeline Details

commerce-foundation-pro agreement pipeline

Component Detail
Streaming stage agreement_product_read_optimized_stage — 30-second trigger from S3 DATOS
SCD build agreement_offer_phases_scd -> agreement_dim -> agreement_product_dim
Schedule Hourly (0 * * * *) via commerce_foundation_agreement_product_dim_dag
Storage Databricks Delta with Iceberg universal format enabled
Existing Snowflake consumer cdp-pro: audience_unified_nrt_agreement_product_dim_iceberg

Current unified pipeline execution

Time (UTC) DAG Layer
07:00 commerce_core_dim_unified_subscribers_daily_load L1: Base SCD
08:00 commerce_core_fact_unified_subscriber_facts_load L2-L4: Transitions, movements, dimension movements
08:00 commerce_core_fact_unified_subscriber_growth_accounting_load L5: Growth accounting
08:00 commerce_core_fact_unified_subscriber_aggregate_balance_load L6: Aggregate balance
08:00 commerce_core_fact_unified_subscriber_metrics_load L7: Metrics
09:00 commerce_core_fact_unified_v2_subscribers_daily_load v2 base table (the one in UAT)

Key table sizes (Snowflake, 2026-04-27)

Table Row Count Storage Layer
dim_unified_subscribers_scd 5.58B L1
fact_unified_subscriber_growth_movements 5.77B L3
fact_unified_subscriber_dimension_movements 3.97B L4
fact_unified_subscriber_aggregate_balance 2.80B L6
fact_unified_subscriber_metrics 2.66B L7
fact_unified_subscriber_growth_accounting 92.9M L5
fact_unified_v2_subscribers 1.77B v2 base
fact_unified_v2_service_helper 3.67B v2 helper
SUBSCRIPTION_HIST (native) 416B 241 TB Legacy
SUBSCRIPTION_HIST_EXT (Delta) 7.4T 736 TB Legacy
DIM_SUBSCRIBERS_SCD 13.6B 4.0 TB Current foundation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment