Candivore: POC + Path Forward

Snowplow's Schema

Snowplow is a state-of-the-art data collection platform which advocates for self-describing-objects. This means that events (rows) that travel through Snowplow's pipeline have a highly nested and hierarchical structure:

{
  "event": "user_transaction",
  "time": 1730592005,
  "APP_ID": "Match-Masters",
  "EVENT_VENDOR": "io.candivore",
  "COLLECTOR_TSTAMP": "2024-11-03T00:00:08",
  "CONTEXTS_COM_IAB_SNOWPLOW_SPIDERS_AND_ROBOTS_1": [
    {
      "_schema_version": "1-0-0",
      "category": "SPIDER_OR_ROBOT",
      "primaryImpact": "UNKNOWN",
      "reason": "FAILED_UA_INCLUDE",
      "spiderOrRobot": true
    }
  ],  "CONTEXTS_COM_SNOWPLOWANALYTICS_SNOWPLOW_UA_PARSER_CONTEXT_1": [
    {
      "_schema_version": "1-0-0",
      "deviceFamily": "Other",
     //etc...
    }
  ],
  "CONTEXTS_IO_CANDIVORE_USER_BASE_STATS_1": [
    {
      "LT_matches_played_cnt": 14422,
      "LT_matches_won_cnt": 6766,
      "LT_purchases_amt": 1,
    //etc...
    }
  ],
  "CONTEXTS_NL_BASJES_YAUAA_CONTEXT_1": [
    {
      "_schema_version": "1-0-4",
      "agentClass": "Hacker",
     //etc...
    }
  ], 
  "UNSTRUCT_EVENT_IO_CANDIVORE_USER_TRANSACTION_1": {
    "is_coin_paid": false,
    "is_received_resource": false,
  //etc...
  },
  "VALUE": {
    "LT_matches_played_cnt": 14422,
  //etc...
  }
}

This is a good thing for data collection; this is a challenge for data analysis.

Snowflake's Schema

When Snowplow loads data into Snowflake (via RDB Loader), it uses the OBJECT or ARRAY generic snowflake types which (importantly) do not schematize nested fields:

CREATE OR REPLACE TABLE example_table (id int, DATA OBJECT);

INSERT INTO example_table (id, DATA)
VALUES (1, Object_construct('name', 'Alice', 'age', 30, 'skills', Array_construct('SQL', 'Python'))) ,
       (2, Object_construct('name', 'Bob', 'age', 25, 'skills', Array_construct('Java', 'Scala'))) ,
       (3, Object_construct('name', 'Carol', 'age', 35, 'skills', Array_construct('R', 'Tableau')));

So when you query the table:

SELECT *
FROM example_table
WHERE DATA:age > 30;

yields a semi-structured result:

id	data
3	{"name": "Carol", "age": 35, "skills": ["R", "Tableau"]}`

You Snowflake can flatten using different scalar functions:

SELECT id,
       skill.value AS skill
FROM example_table,
     LATERAL FLATTEN(INPUT => DATA:skills) AS skill;

yields:

id	skill
1	SQL
1	Python
2	Java
2	Scala
3	R
3	Tableau

But actually, most of the time we just want to use the : operator (path notation for object retrieval) to create a (simple) relational table:

SELECT id,
       DATA:name AS name,
            DATA:age AS age
FROM example_table;

yields:

id	name	age
1	Alice	30
2	Bob	25
3	Carol	35

Mixpanel's Schema

When Mixpanel loads data from Snowflake (via warehouse connectors), Mixpanel assumes the table is denormalized, and important analysis dimensions are top level columns.

Mixpanel does support Object and Array types, however, with limited governance (semantic layer):

(while you can look within the objects, we can't rename or hide nested fields)

In the current state, most of the valuable data is not highly accessible in Mixpanel.

POC Solution

Mixpanel can unnest values at query time using custom properties:

(Amit made a great video about it)

We analyzed your schema and created ~500 custom properties for all of the nested fields in your semi-structured snowplow data.

This should unblock your POC and get the team activated on self serve!

Prod Solution

If Candivore adopts Mixpanel as it's self-serve event analytics tool of choice, some form of column management upstream of mixpanel will be necessary to ensure

There are many approaches to "flattening STRUCTS, the simplest way is in a view :

CREATE OR REPLACE VIEW mixpanel_events_view AS
SELECT
	
	-- ensure all events have user id
    COALESCE(user_id, UNSTRUCT_EVENT_IO_CANDIVORE_USER_BASE_STATS_1:user_id_nested) AS user_id,
	
	-- denormalize nested values
	UNSTRUCT_EVENT_IO_CANDIVORE_POP_UP_1:trigger as pop_up_trigger
    
    -- Include any other columns
    other_column1,
    other_column2,
FROM your_source_table;

This "wrangling" of Snowplow context will be necessary for any analytics tool the team choose to adopt.