Snowplow is a state-of-the-art data collection platform which advocates for self-describing-objects. This means that events (rows) that travel through Snowplow's pipeline have a highly nested and hierarchical structure:
{
"event": "user_transaction",
"time": 1730592005,
"APP_ID": "Match-Masters",
"EVENT_VENDOR": "io.candivore",
"COLLECTOR_TSTAMP": "2024-11-03T00:00:08",
"CONTEXTS_COM_IAB_SNOWPLOW_SPIDERS_AND_ROBOTS_1": [
{
"_schema_version": "1-0-0",
"category": "SPIDER_OR_ROBOT",
"primaryImpact": "UNKNOWN",
"reason": "FAILED_UA_INCLUDE",
"spiderOrRobot": true
}
], "CONTEXTS_COM_SNOWPLOWANALYTICS_SNOWPLOW_UA_PARSER_CONTEXT_1": [
{
"_schema_version": "1-0-0",
"deviceFamily": "Other",
//etc...
}
],
"CONTEXTS_IO_CANDIVORE_USER_BASE_STATS_1": [
{
"LT_matches_played_cnt": 14422,
"LT_matches_won_cnt": 6766,
"LT_purchases_amt": 1,
//etc...
}
],
"CONTEXTS_NL_BASJES_YAUAA_CONTEXT_1": [
{
"_schema_version": "1-0-4",
"agentClass": "Hacker",
//etc...
}
],
"UNSTRUCT_EVENT_IO_CANDIVORE_USER_TRANSACTION_1": {
"is_coin_paid": false,
"is_received_resource": false,
//etc...
},
"VALUE": {
"LT_matches_played_cnt": 14422,
//etc...
}
}
This is a good thing for data collection; this is a challenge for data analysis.
When Snowplow loads data into Snowflake (via RDB Loader), it uses the OBJECT
or ARRAY
generic snowflake types which (importantly) do not schematize nested fields:
CREATE OR REPLACE TABLE example_table (id int, DATA OBJECT);
INSERT INTO example_table (id, DATA)
VALUES (1, Object_construct('name', 'Alice', 'age', 30, 'skills', Array_construct('SQL', 'Python'))) ,
(2, Object_construct('name', 'Bob', 'age', 25, 'skills', Array_construct('Java', 'Scala'))) ,
(3, Object_construct('name', 'Carol', 'age', 35, 'skills', Array_construct('R', 'Tableau')));
So when you query the table:
SELECT *
FROM example_table
WHERE DATA:age > 30;
yields a semi-structured result:
id | data |
---|---|
3 | {"name": "Carol", "age": 35, "skills": ["R", "Tableau"]}` |
You Snowflake can flatten using different scalar functions:
SELECT id,
skill.value AS skill
FROM example_table,
LATERAL FLATTEN(INPUT => DATA:skills) AS skill;
yields:
id | skill |
---|---|
1 | SQL |
1 | Python |
2 | Java |
2 | Scala |
3 | R |
3 | Tableau |
But actually, most of the time we just want to use the :
operator (path notation for object retrieval) to create a (simple) relational table:
SELECT id,
DATA:name AS name,
DATA:age AS age
FROM example_table;
yields:
id | name | age |
---|---|---|
1 | Alice | 30 |
2 | Bob | 25 |
3 | Carol | 35 |
When Mixpanel loads data from Snowflake (via warehouse connectors), Mixpanel assumes the table is denormalized, and important analysis dimensions are top level columns.
Mixpanel does support Object
and Array
types, however, with limited governance (semantic layer):
In the current state, most of the valuable data is not highly accessible in Mixpanel.
Mixpanel can unnest values at query time using custom properties:
(Amit made a great video about it)
We analyzed your schema and created ~500 custom properties for all of the nested fields in your semi-structured snowplow data.
This should unblock your POC and get the team activated on self serve!
If Candivore adopts Mixpanel as it's self-serve event analytics tool of choice, some form of column management upstream of mixpanel will be necessary to ensure
There are many approaches to "flattening STRUCTS
, the simplest way is in a view
:
CREATE OR REPLACE VIEW mixpanel_events_view AS
SELECT
-- ensure all events have user id
COALESCE(user_id, UNSTRUCT_EVENT_IO_CANDIVORE_USER_BASE_STATS_1:user_id_nested) AS user_id,
-- denormalize nested values
UNSTRUCT_EVENT_IO_CANDIVORE_POP_UP_1:trigger as pop_up_trigger
-- Include any other columns
other_column1,
other_column2,
FROM your_source_table;
This "wrangling" of Snowplow context will be necessary for any analytics tool the team choose to adopt.