Created
May 16, 2023 20:55
-
-
Save rmoff/67bcca1fdaf8f1b887578eba1a318446 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "e8e13cd9", | |
"metadata": {}, | |
"source": [ | |
"<img src=\"https://projectnessie.org/img/nessie.svg\" alt=\"lakeFS logo\" width=200/> " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "78797e22", | |
"metadata": {}, | |
"source": [ | |
"## Write-Audit-Publish (WAP) pattern with Nessie" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "0f89a51c", | |
"metadata": {}, | |
"source": [ | |
"**New to Write-Audit-Publish? This [talk](https://www.youtube.com/watch?v=fXHdeBnpXrg&t=1001s) explains it well.**\n", | |
"\n", | |
"[@rmoff](https://twitter.com/rmoff/) " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "7d979dfe", | |
"metadata": {}, | |
"source": [ | |
"# Setup & Initialisation" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"id": "45b87dc4", | |
"metadata": { | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Requirement already satisfied: pynessie==0.30.0 in /opt/conda/lib/python3.10/site-packages (0.30.0)\n", | |
"Requirement already satisfied: findspark in /opt/conda/lib/python3.10/site-packages (2.0.1)\n", | |
"Requirement already satisfied: attrs in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (22.2.0)\n", | |
"Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (2.8.2)\n", | |
"Requirement already satisfied: marshmallow in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (3.19.0)\n", | |
"Requirement already satisfied: confuse==1.7.0 in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (1.7.0)\n", | |
"Requirement already satisfied: Click<9.0.0,>6.0.0 in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (8.1.3)\n", | |
"Requirement already satisfied: requests-aws4auth in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (1.2.3)\n", | |
"Requirement already satisfied: botocore in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (1.29.131)\n", | |
"Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (2.28.2)\n", | |
"Requirement already satisfied: simplejson in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (3.19.1)\n", | |
"Requirement already satisfied: marshmallow-oneofschema in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (3.0.1)\n", | |
"Requirement already satisfied: desert in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (2022.9.22)\n", | |
"Requirement already satisfied: pyyaml in /opt/conda/lib/python3.10/site-packages (from confuse==1.7.0->pynessie==0.30.0) (6.0)\n", | |
"Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from botocore->pynessie==0.30.0) (1.0.1)\n", | |
"Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.10/site-packages (from botocore->pynessie==0.30.0) (1.26.15)\n", | |
"Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil->pynessie==0.30.0) (1.16.0)\n", | |
"Requirement already satisfied: typing-inspect in /opt/conda/lib/python3.10/site-packages (from desert->pynessie==0.30.0) (0.8.0)\n", | |
"Requirement already satisfied: packaging>=17.0 in /opt/conda/lib/python3.10/site-packages (from marshmallow->pynessie==0.30.0) (23.0)\n", | |
"Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->pynessie==0.30.0) (3.1.0)\n", | |
"Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->pynessie==0.30.0) (2022.12.7)\n", | |
"Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->pynessie==0.30.0) (3.4)\n", | |
"Requirement already satisfied: mypy-extensions>=0.3.0 in /opt/conda/lib/python3.10/site-packages (from typing-inspect->desert->pynessie==0.30.0) (1.0.0)\n", | |
"Requirement already satisfied: typing-extensions>=3.7.4 in /opt/conda/lib/python3.10/site-packages (from typing-inspect->desert->pynessie==0.30.0) (4.5.0)\n" | |
] | |
} | |
], | |
"source": [ | |
"import sys\n", | |
"!{sys.executable} -m pip install pynessie==0.30.0 findspark" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "4e4e906a", | |
"metadata": {}, | |
"source": [ | |
"## Set up Spark " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"id": "fbbb1ab3", | |
"metadata": { | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Spark Running\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/html": [ | |
"\n", | |
" <div>\n", | |
" <p><b>SparkSession - in-memory</b></p>\n", | |
" \n", | |
" <div>\n", | |
" <p><b>SparkContext</b></p>\n", | |
"\n", | |
" <p><a href=\"http://e3ed1f564f29:4040\">Spark UI</a></p>\n", | |
"\n", | |
" <dl>\n", | |
" <dt>Version</dt>\n", | |
" <dd><code>v3.3.2</code></dd>\n", | |
" <dt>Master</dt>\n", | |
" <dd><code>local[*]</code></dd>\n", | |
" <dt>AppName</dt>\n", | |
" <dd><code>pyspark-shell</code></dd>\n", | |
" </dl>\n", | |
" </div>\n", | |
" \n", | |
" </div>\n", | |
" " | |
], | |
"text/plain": [ | |
"<pyspark.sql.session.SparkSession at 0xffff76410a30>" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import os\n", | |
"from pyspark.sql import *\n", | |
"from pyspark import SparkConf\n", | |
"import pynessie\n", | |
"\n", | |
"# import findspark\n", | |
"# findspark.init()\n", | |
"# pynessie_version = pynessie.__version__\n", | |
"\n", | |
"conf = SparkConf()\n", | |
"# we need iceberg libraries and the nessie sql extensions\n", | |
"conf.set(\n", | |
" \"spark.jars.packages\",\"org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.0,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.3_2.12:0.58.1\"\n", | |
")\n", | |
"# ensure python <-> java interactions are w/ pyarrow\n", | |
"conf.set(\"spark.sql.execution.pyarrow.enabled\", \"true\")\n", | |
"# create catalog named arctic as an iceberg catalog\n", | |
"conf.set(\"spark.sql.catalog.rmoff\", \"org.apache.iceberg.spark.SparkCatalog\")\n", | |
"\n", | |
"# tell the catalog that its a Nessie catalog\n", | |
"conf.set(\"spark.sql.catalog.rmoff.catalog-impl\", \"org.apache.iceberg.nessie.NessieCatalog\")\n", | |
"\n", | |
"# set the location for the catalog to store data. Spark writes to this directory\n", | |
"conf.set(\"spark.sql.catalog.rmoff.warehouse\", \"file://\" + os.getcwd() + \"/spark_warehouse/iceberg\")\n", | |
"\n", | |
"# set the location of the Arctic/Nessie server.\n", | |
"conf.set(\"spark.sql.catalog.rmoff.uri\", \"http://nessie:19120/api/v1\")\n", | |
"\n", | |
"# default branch for Arctic catalog to work on\n", | |
"conf.set(\"spark.sql.catalog.rmoff.ref\", \"main\")\n", | |
"\n", | |
"\n", | |
"# enable the extensions for both Nessie and Iceberg\n", | |
"conf.set(\n", | |
" \"spark.sql.extensions\",\n", | |
" \"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions\",\n", | |
")\n", | |
"\n", | |
"# finally, start up the Spark server\n", | |
"spark = SparkSession.builder.config(conf=conf).getOrCreate()\n", | |
"print(\"Spark Running\")\n", | |
"spark" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "eceeaa9b", | |
"metadata": {}, | |
"source": [ | |
"## Load test data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"id": "ad496308", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"spark.read.option(\"inferSchema\",\"true\").option(\"multiline\",\"true\").json(\"/home/jovyan/data/nyc_film_permits.json\").createOrReplaceTempView(\"permits_src\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"id": "8827299b", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table>\n", | |
" <tr>\n", | |
" <th>borough</th>\n", | |
" <th>permit_cnt</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Queens</td>\n", | |
" <td>168</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Brooklyn</td>\n", | |
" <td>334</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Staten Island</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Manhattan</td>\n", | |
" <td>463</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Bronx</td>\n", | |
" <td>28</td>\n", | |
" </tr>\n", | |
"</table>" | |
], | |
"text/plain": [ | |
"+---------------+------------+\n", | |
"| borough | permit_cnt |\n", | |
"+---------------+------------+\n", | |
"| Queens | 168 |\n", | |
"| Brooklyn | 334 |\n", | |
"| Staten Island | 7 |\n", | |
"| Manhattan | 463 |\n", | |
"| Bronx | 28 |\n", | |
"+---------------+------------+" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%%sql\n", | |
"SELECT borough, count(*) permit_cnt\n", | |
"FROM permits_src\n", | |
"GROUP BY borough" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "3114eec3", | |
"metadata": {}, | |
"source": [ | |
"## Write test data to Iceberg files " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"id": "3143e7e1", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table>\n", | |
" <tr>\n", | |
" </tr>\n", | |
"</table>" | |
], | |
"text/plain": [ | |
"++\n", | |
"||\n", | |
"++\n", | |
"++" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%%sql \n", | |
"\n", | |
"CREATE TABLE rmoff.permits USING ICEBERG\n", | |
"AS SELECT * FROM permits_src" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "a0bc6fd3", | |
"metadata": {}, | |
"source": [ | |
"#### Inspect Iceberg metadata" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"id": "df9f12ae", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table>\n", | |
" <tr>\n", | |
" <th>content</th>\n", | |
" <th>file_path</th>\n", | |
" <th>file_format</th>\n", | |
" <th>spec_id</th>\n", | |
" <th>record_count</th>\n", | |
" <th>file_size_in_bytes</th>\n", | |
" <th>column_sizes</th>\n", | |
" <th>value_counts</th>\n", | |
" <th>null_value_counts</th>\n", | |
" <th>nan_value_counts</th>\n", | |
" <th>lower_bounds</th>\n", | |
" <th>upper_bounds</th>\n", | |
" <th>key_metadata</th>\n", | |
" <th>split_offsets</th>\n", | |
" <th>equality_ids</th>\n", | |
" <th>sort_order_id</th>\n", | |
" <th>readable_metrics</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>0</td>\n", | |
" <td>file:/home/jovyan/work/spark_warehouse/iceberg/permits_30f4184f-81ed-45c3-b21c-2e93bc1a5d1e/data/00000-3-bdcc692d-fb2f-47ab-a7d5-b9e205989800-00001.parquet</td>\n", | |
" <td>PARQUET</td>\n", | |
" <td>0</td>\n", | |
" <td>1000</td>\n", | |
" <td>51115</td>\n", | |
" <td>{1: 483, 2: 474, 3: 1183, 4: 119, 5: 2736, 6: 5023, 7: 142, 8: 2348, 9: 343, 10: 26704, 11: 1487, 12: 2462, 13: 745, 14: 2358}</td>\n", | |
" <td>{1: 1000, 2: 1000, 3: 1000, 4: 1000, 5: 1000, 6: 1000, 7: 1000, 8: 1000, 9: 1000, 10: 1000, 11: 1000, 12: 1000, 13: 1000, 14: 1000}</td>\n", | |
" <td>{1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0}</td>\n", | |
" <td>{}</td>\n", | |
" <td>{1: bytearray(b'Bronx'), 2: bytearray(b'Commercial'), 3: bytearray(b'0, 2, 3'), 4: bytearray(b'United States of'), 5: bytearray(b'2022-11-04T22:00'), 6: bytearray(b'2022-11-02T13:34'), 7: bytearray(b"Mayor\\'s Office o"), 8: bytearray(b'678909'), 9: bytearray(b'DCAS Prep/Shoot/'), 10: bytearray(b'1 AVENUE between'), 11: bytearray(b'0, 10'), 12: bytearray(b'2022-11-03T00:00'), 13: bytearray(b'Cable-episodic'), 14: bytearray(b'0, 10011')}</td>\n", | |
" <td>{1: bytearray(b'Staten Island'), 2: bytearray(b'WEB'), 3: bytearray(b'9'), 4: bytearray(b'United States og'), 5: bytearray(b'2023-02-20T18:01'), 6: bytearray(b'2023-01-18T14:35'), 7: bytearray(b"Mayor\\'s Office p"), 8: bytearray(b'691875'), 9: bytearray(b'Theater Load in!'), 10: bytearray(b'WYTHE AVENUE beu'), 11: bytearray(b'94'), 12: bytearray(b'2023-01-20T13:01'), 13: bytearray(b'Variety'), 14: bytearray(b'11693, 11694')}</td>\n", | |
" <td>None</td>\n", | |
" <td>[4]</td>\n", | |
" <td>None</td>\n", | |
" <td>0</td>\n", | |
" <td>Row(borough=Row(column_size=483, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='Bronx', upper_bound='Staten Island'), category=Row(column_size=474, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='Commercial', upper_bound='WEB'), communityboard_s=Row(column_size=1183, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='0, 2, 3', upper_bound='9'), country=Row(column_size=119, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='United States of', upper_bound='United States og'), enddatetime=Row(column_size=2736, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='2022-11-04T22:00', upper_bound='2023-02-20T18:01'), enteredon=Row(column_size=5023, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='2022-11-02T13:34', upper_bound='2023-01-18T14:35'), eventagency=Row(column_size=142, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound="Mayor's Office o", upper_bound="Mayor's Office p"), eventid=Row(column_size=2348, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='678909', upper_bound='691875'), eventtype=Row(column_size=343, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='DCAS Prep/Shoot/', upper_bound='Theater Load in!'), parkingheld=Row(column_size=26704, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='1 AVENUE between', upper_bound='WYTHE AVENUE beu'), policeprecinct_s=Row(column_size=1487, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='0, 10', upper_bound='94'), startdatetime=Row(column_size=2462, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='2022-11-03T00:00', upper_bound='2023-01-20T13:01'), subcategoryname=Row(column_size=745, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='Cable-episodic', upper_bound='Variety'), zipcode_s=Row(column_size=2358, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='0, 10011', upper_bound='11693, 11694'))</td>\n", | |
" </tr>\n", | |
"</table>" | |
], | |
"text/plain": [ | |
"+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+---------+--------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+---------------+--------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", | |
"| content | file_path | file_format | spec_id | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | nan_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets | equality_ids | sort_order_id | readable_metrics |\n", | |
"+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+---------+--------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+---------------+--------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", | |
"| 0 | file:/home/jovyan/work/spark_warehouse/iceberg/permits_30f4184f-81ed-45c3-b21c-2e93bc1a5d1e/data/00000-3-bdcc692d-fb2f-47ab-a7d5-b9e205989800-00001.parquet | PARQUET | 0 | 1000 | 51115 | {1: 483, 2: 474, 3: 1183, 4: 119, 5: 2736, 6: 5023, 7: 142, 8: 2348, 9: 343, 10: 26704, 11: 1487, 12: 2462, 13: 745, 14: 2358} | {1: 1000, 2: 1000, 3: 1000, 4: 1000, 5: 1000, 6: 1000, 7: 1000, 8: 1000, 9: 1000, 10: 1000, 11: 1000, 12: 1000, 13: 1000, 14: 1000} | {1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0} | {} | {1: bytearray(b'Bronx'), 2: bytearray(b'Commercial'), 3: bytearray(b'0, 2, 3'), 4: bytearray(b'United States of'), 5: bytearray(b'2022-11-04T22:00'), 6: bytearray(b'2022-11-02T13:34'), 7: bytearray(b\"Mayor\\'s Office o\"), 8: bytearray(b'678909'), 9: bytearray(b'DCAS Prep/Shoot/'), 10: bytearray(b'1 AVENUE between'), 11: bytearray(b'0, 10'), 12: bytearray(b'2022-11-03T00:00'), 13: bytearray(b'Cable-episodic'), 14: bytearray(b'0, 10011')} | {1: bytearray(b'Staten Island'), 2: bytearray(b'WEB'), 3: bytearray(b'9'), 4: bytearray(b'United States og'), 5: bytearray(b'2023-02-20T18:01'), 6: bytearray(b'2023-01-18T14:35'), 7: bytearray(b\"Mayor\\'s Office p\"), 8: bytearray(b'691875'), 9: bytearray(b'Theater Load in!'), 10: bytearray(b'WYTHE AVENUE beu'), 11: bytearray(b'94'), 12: bytearray(b'2023-01-20T13:01'), 13: bytearray(b'Variety'), 14: bytearray(b'11693, 11694')} | None | [4] | None | 0 | Row(borough=Row(column_size=483, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='Bronx', upper_bound='Staten Island'), category=Row(column_size=474, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='Commercial', upper_bound='WEB'), communityboard_s=Row(column_size=1183, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='0, 2, 3', upper_bound='9'), country=Row(column_size=119, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='United States of', upper_bound='United States og'), enddatetime=Row(column_size=2736, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='2022-11-04T22:00', upper_bound='2023-02-20T18:01'), enteredon=Row(column_size=5023, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='2022-11-02T13:34', upper_bound='2023-01-18T14:35'), eventagency=Row(column_size=142, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound=\"Mayor's Office o\", upper_bound=\"Mayor's Office p\"), eventid=Row(column_size=2348, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='678909', upper_bound='691875'), eventtype=Row(column_size=343, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='DCAS Prep/Shoot/', upper_bound='Theater Load in!'), parkingheld=Row(column_size=26704, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='1 AVENUE between', upper_bound='WYTHE AVENUE beu'), policeprecinct_s=Row(column_size=1487, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='0, 10', upper_bound='94'), startdatetime=Row(column_size=2462, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='2022-11-03T00:00', upper_bound='2023-01-20T13:01'), subcategoryname=Row(column_size=745, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='Cable-episodic', upper_bound='Variety'), zipcode_s=Row(column_size=2358, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='0, 10011', upper_bound='11693, 11694')) |\n", | |
"+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+---------+--------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+---------------+--------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%sql SELECT * FROM rmoff.permits.files" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"id": "4a493930", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table>\n", | |
" <tr>\n", | |
" <th>made_current_at</th>\n", | |
" <th>snapshot_id</th>\n", | |
" <th>parent_id</th>\n", | |
" <th>is_current_ancestor</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>2023-05-16 20:53:32.769000</td>\n", | |
" <td>776903278721003441</td>\n", | |
" <td>None</td>\n", | |
" <td>True</td>\n", | |
" </tr>\n", | |
"</table>" | |
], | |
"text/plain": [ | |
"+----------------------------+--------------------+-----------+---------------------+\n", | |
"| made_current_at | snapshot_id | parent_id | is_current_ancestor |\n", | |
"+----------------------------+--------------------+-----------+---------------------+\n", | |
"| 2023-05-16 20:53:32.769000 | 776903278721003441 | None | True |\n", | |
"+----------------------------+--------------------+-----------+---------------------+" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%sql SELECT * FROM rmoff.permits.history" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"id": "13f312d0", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table>\n", | |
" <tr>\n", | |
" <th>committed_at</th>\n", | |
" <th>snapshot_id</th>\n", | |
" <th>parent_id</th>\n", | |
" <th>operation</th>\n", | |
" <th>manifest_list</th>\n", | |
" <th>summary</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>2023-05-16 20:53:32.769000</td>\n", | |
" <td>776903278721003441</td>\n", | |
" <td>None</td>\n", | |
" <td>append</td>\n", | |
" <td>file:/home/jovyan/work/spark_warehouse/iceberg/permits_30f4184f-81ed-45c3-b21c-2e93bc1a5d1e/metadata/snap-776903278721003441-1-fdaa71ad-87db-4ad5-8862-463e3246dd9e.avro</td>\n", | |
" <td>{'spark.app.id': 'local-1684270382136', 'changed-partition-count': '1', 'added-data-files': '1', 'total-equality-deletes': '0', 'added-records': '1000', 'total-position-deletes': '0', 'added-files-size': '51115', 'total-delete-files': '0', 'total-files-size': '51115', 'total-records': '1000', 'total-data-files': '1'}</td>\n", | |
" </tr>\n", | |
"</table>" | |
], | |
"text/plain": [ | |
"+----------------------------+--------------------+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", | |
"| committed_at | snapshot_id | parent_id | operation | manifest_list | summary |\n", | |
"+----------------------------+--------------------+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", | |
"| 2023-05-16 20:53:32.769000 | 776903278721003441 | None | append | file:/home/jovyan/work/spark_warehouse/iceberg/permits_30f4184f-81ed-45c3-b21c-2e93bc1a5d1e/metadata/snap-776903278721003441-1-fdaa71ad-87db-4ad5-8862-463e3246dd9e.avro | {'spark.app.id': 'local-1684270382136', 'changed-partition-count': '1', 'added-data-files': '1', 'total-equality-deletes': '0', 'added-records': '1000', 'total-position-deletes': '0', 'added-files-size': '51115', 'total-delete-files': '0', 'total-files-size': '51115', 'total-records': '1000', 'total-data-files': '1'} |\n", | |
"+----------------------------+--------------------+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%sql SELECT * FROM rmoff.permits.snapshots" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "dd3c5e94", | |
"metadata": {}, | |
"source": [ | |
"# The Setup" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "c8ac1043", | |
"metadata": {}, | |
"source": [ | |
"## Create Nessie branch " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"id": "6eb07626", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"branch='etl_job_42'" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"id": "8351cf07", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table>\n", | |
" <tr>\n", | |
" <th>refType</th>\n", | |
" <th>name</th>\n", | |
" <th>hash</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Branch</td>\n", | |
" <td>etl_job_42</td>\n", | |
" <td>d7b2235a73fc5d6ea59e6239236649d953c769ba845f4f6b535c2e5abba81b8d</td>\n", | |
" </tr>\n", | |
"</table>" | |
], | |
"text/plain": [ | |
"+---------+------------+------------------------------------------------------------------+\n", | |
"| refType | name | hash |\n", | |
"+---------+------------+------------------------------------------------------------------+\n", | |
"| Branch | etl_job_42 | d7b2235a73fc5d6ea59e6239236649d953c769ba845f4f6b535c2e5abba81b8d |\n", | |
"+---------+------------+------------------------------------------------------------------+" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%sql CREATE BRANCH {branch} IN rmoff FROM main" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "3beacc72", | |
"metadata": {}, | |
"source": [ | |
"### Use the new branch for reading and writing" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "6f43a4af", | |
"metadata": {}, | |
"source": [ | |
"#### Now change the `REFERENCE`" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"id": "3d230c6c", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table>\n", | |
" <tr>\n", | |
" <th>refType</th>\n", | |
" <th>name</th>\n", | |
" <th>hash</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Branch</td>\n", | |
" <td>etl_job_42</td>\n", | |
" <td>d7b2235a73fc5d6ea59e6239236649d953c769ba845f4f6b535c2e5abba81b8d</td>\n", | |
" </tr>\n", | |
"</table>" | |
], | |
"text/plain": [ | |
"+---------+------------+------------------------------------------------------------------+\n", | |
"| refType | name | hash |\n", | |
"+---------+------------+------------------------------------------------------------------+\n", | |
"| Branch | etl_job_42 | d7b2235a73fc5d6ea59e6239236649d953c769ba845f4f6b535c2e5abba81b8d |\n", | |
"+---------+------------+------------------------------------------------------------------+" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%sql USE REFERENCE {branch} IN rmoff" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f5ab667f", | |
"metadata": {}, | |
"source": [ | |
"### Show list of references in Nessie" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"id": "a6ae4b95", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table>\n", | |
" <tr>\n", | |
" <th>refType</th>\n", | |
" <th>name</th>\n", | |
" <th>hash</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Branch</td>\n", | |
" <td>etl_job_42</td>\n", | |
" <td>d7b2235a73fc5d6ea59e6239236649d953c769ba845f4f6b535c2e5abba81b8d</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Branch</td>\n", | |
" <td>main</td>\n", | |
" <td>d7b2235a73fc5d6ea59e6239236649d953c769ba845f4f6b535c2e5abba81b8d</td>\n", | |
" </tr>\n", | |
"</table>" | |
], | |
"text/plain": [ | |
"+---------+------------+------------------------------------------------------------------+\n", | |
"| refType | name | hash |\n", | |
"+---------+------------+------------------------------------------------------------------+\n", | |
"| Branch | etl_job_42 | d7b2235a73fc5d6ea59e6239236649d953c769ba845f4f6b535c2e5abba81b8d |\n", | |
"| Branch | main | d7b2235a73fc5d6ea59e6239236649d953c769ba845f4f6b535c2e5abba81b8d |\n", | |
"+---------+------------+------------------------------------------------------------------+" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%sql LIST REFERENCES IN rmoff" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "886ac446", | |
"metadata": {}, | |
"source": [ | |
"### Check that we still see the same data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"id": "bfab197f", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table>\n", | |
" <tr>\n", | |
" <th>borough</th>\n", | |
" <th>permit_cnt</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Queens</td>\n", | |
" <td>168</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Bronx</td>\n", | |
" <td>28</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Manhattan</td>\n", | |
" <td>463</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Brooklyn</td>\n", | |
" <td>334</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Staten Island</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
"</table>" | |
], | |
"text/plain": [ | |
"+---------------+------------+\n", | |
"| borough | permit_cnt |\n", | |
"+---------------+------------+\n", | |
"| Queens | 168 |\n", | |
"| Bronx | 28 |\n", | |
"| Manhattan | 463 |\n", | |
"| Brooklyn | 334 |\n", | |
"| Staten Island | 7 |\n", | |
"+---------------+------------+" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%sql SELECT borough, count(*) permit_cnt FROM rmoff.permits GROUP BY borough" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "2ddc4ff7", | |
"metadata": {}, | |
"source": [ | |
"# Write" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "d88056be", | |
"metadata": {}, | |
"source": [ | |
"Update the dataframe to remove rows matching predicate. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"id": "82d21f90", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table>\n", | |
" <tr>\n", | |
" </tr>\n", | |
"</table>" | |
], | |
"text/plain": [ | |
"++\n", | |
"||\n", | |
"++\n", | |
"++" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%sql DELETE FROM rmoff.permits WHERE borough='Manhattan'" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "3f6c3a04", | |
"metadata": {}, | |
"source": [ | |
"## Inspecting the staged/unpublished data" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "9c09de33", | |
"metadata": {}, | |
"source": [ | |
"### Staged/unpublished data" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "d73e5ba8", | |
"metadata": {}, | |
"source": [ | |
"#### The changes are reflected in the table:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"id": "cabb818c", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table>\n", | |
" <tr>\n", | |
" <th>borough</th>\n", | |
" <th>permit_cnt</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Queens</td>\n", | |
" <td>168</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Bronx</td>\n", | |
" <td>28</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Brooklyn</td>\n", | |
" <td>334</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Staten Island</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
"</table>" | |
], | |
"text/plain": [ | |
"+---------------+------------+\n", | |
"| borough | permit_cnt |\n", | |
"+---------------+------------+\n", | |
"| Queens | 168 |\n", | |
"| Bronx | 28 |\n", | |
"| Brooklyn | 334 |\n", | |
"| Staten Island | 7 |\n", | |
"+---------------+------------+" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%sql SELECT borough, count(*) permit_cnt FROM rmoff.permits GROUP BY borough" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "60e26a26", | |
"metadata": {}, | |
"source": [ | |
"### Published data" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "9e5a4eb7", | |
"metadata": {}, | |
"source": [ | |
"The data on the `main` branch remains unchanged. We can validate this by running a query against the data, specifying `main` as the branch using the `@<branch>` suffix:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"id": "6fefa576", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table>\n", | |
" <tr>\n", | |
" <th>borough</th>\n", | |
" <th>permit_cnt</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Queens</td>\n", | |
" <td>168</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Bronx</td>\n", | |
" <td>28</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Manhattan</td>\n", | |
" <td>463</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Brooklyn</td>\n", | |
" <td>334</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Staten Island</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
"</table>" | |
], | |
"text/plain": [ | |
"+---------------+------------+\n", | |
"| borough | permit_cnt |\n", | |
"+---------------+------------+\n", | |
"| Queens | 168 |\n", | |
"| Bronx | 28 |\n", | |
"| Manhattan | 463 |\n", | |
"| Brooklyn | 334 |\n", | |
"| Staten Island | 7 |\n", | |
"+---------------+------------+" | |
] | |
}, | |
"execution_count": 18, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%sql SELECT borough, count(*) permit_cnt FROM rmoff.`permits@main` GROUP BY borough" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "a58244fc", | |
"metadata": {}, | |
"source": [ | |
"# Audit " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "590d8ce7", | |
"metadata": {}, | |
"source": [ | |
"At the moment the data is written to the audit branch (`etl_job_42`), but not published to `main`. \n", | |
"\n", | |
"How you audit the data is up to you. The nice thing about the data being staged is that you can do it within the same ETL job, or have another tool do it.\n", | |
"\n", | |
"Here's a very simple example of doing in Python. We're going to programatically check that only the four expected boroughs remain in the data.\n", | |
"\n", | |
"First, we define those that are expected:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"id": "62d6b827", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"expected_boroughs = {\"Queens\", \"Brooklyn\", \"Bronx\", \"Staten Island\"}" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "3c05b9a5", | |
"metadata": {}, | |
"source": [ | |
"Then we get a set of the actual boroughs in the staged data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"id": "77abc608", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"distinct_boroughs = spark.sql(\"SELECT DISTINCT borough FROM rmoff.permits\").toLocalIterator()\n", | |
"boroughs = {row[0] for row in distinct_boroughs}" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "81156579", | |
"metadata": {}, | |
"source": [ | |
"Now we do two checks:\n", | |
"\n", | |
"1. Compare the length of the expected vs actual set\n", | |
"2. Check that the two sets when unioned are still the same length. This is necessary, since the first test isn't sufficient alone" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"id": "c91668d4", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Audit has passed 🙌🏻\n" | |
] | |
} | |
], | |
"source": [ | |
"if ( (len(boroughs) != len(expected_boroughs)) \\\n", | |
" or (len(boroughs) != len(set.union(boroughs, expected_boroughs))) \\\n", | |
" or (len(expected_boroughs) != len(set.union(boroughs, expected_boroughs)))):\n", | |
" raise ValueError(f\"Audit failed, borough set does not match expected boroughs: {boroughs} != {expected_boroughs}\")\n", | |
"else:\n", | |
" print(f\"Audit has passed 🙌🏻\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "d19526bb", | |
"metadata": {}, | |
"source": [ | |
"# Publish" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "6dc07949", | |
"metadata": {}, | |
"source": [ | |
"Publishing data in Nessie means merging the audit branch back into `main`, making it available to anyone working with the data in that branch." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"id": "dd0375e9", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table>\n", | |
" <tr>\n", | |
" <th>name</th>\n", | |
" <th>hash</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>main</td>\n", | |
" <td>48601a4f7a54d2d3d985e72da3f9760cc2b8db9e72c98a2004d9fd9671d30bc5</td>\n", | |
" </tr>\n", | |
"</table>" | |
], | |
"text/plain": [ | |
"+------+------------------------------------------------------------------+\n", | |
"| name | hash |\n", | |
"+------+------------------------------------------------------------------+\n", | |
"| main | 48601a4f7a54d2d3d985e72da3f9760cc2b8db9e72c98a2004d9fd9671d30bc5 |\n", | |
"+------+------------------------------------------------------------------+" | |
] | |
}, | |
"execution_count": 22, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%sql MERGE BRANCH {branch} INTO main IN rmoff" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "d38524dd", | |
"metadata": {}, | |
"source": [ | |
"## Inspecting the published data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"id": "0c3dbf77", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table>\n", | |
" <tr>\n", | |
" <th>borough</th>\n", | |
" <th>permit_cnt</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Queens</td>\n", | |
" <td>168</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Bronx</td>\n", | |
" <td>28</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Brooklyn</td>\n", | |
" <td>334</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Staten Island</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
"</table>" | |
], | |
"text/plain": [ | |
"+---------------+------------+\n", | |
"| borough | permit_cnt |\n", | |
"+---------------+------------+\n", | |
"| Queens | 168 |\n", | |
"| Bronx | 28 |\n", | |
"| Brooklyn | 334 |\n", | |
"| Staten Island | 7 |\n", | |
"+---------------+------------+" | |
] | |
}, | |
"execution_count": 25, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%sql SELECT borough, count(*) permit_cnt FROM rmoff.`permits@main` GROUP BY borough" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "37af8102", | |
"metadata": {}, | |
"source": [ | |
"You can also change the REFERENCE context back to `main` and query the table directly" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"id": "6b4fef6a", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table>\n", | |
" <tr>\n", | |
" <th>refType</th>\n", | |
" <th>name</th>\n", | |
" <th>hash</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Branch</td>\n", | |
" <td>main</td>\n", | |
" <td>48601a4f7a54d2d3d985e72da3f9760cc2b8db9e72c98a2004d9fd9671d30bc5</td>\n", | |
" </tr>\n", | |
"</table>" | |
], | |
"text/plain": [ | |
"+---------+------+------------------------------------------------------------------+\n", | |
"| refType | name | hash |\n", | |
"+---------+------+------------------------------------------------------------------+\n", | |
"| Branch | main | 48601a4f7a54d2d3d985e72da3f9760cc2b8db9e72c98a2004d9fd9671d30bc5 |\n", | |
"+---------+------+------------------------------------------------------------------+" | |
] | |
}, | |
"execution_count": 23, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%sql USE REFERENCE main IN rmoff" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"id": "a735db09", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table>\n", | |
" <tr>\n", | |
" <th>borough</th>\n", | |
" <th>permit_cnt</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Queens</td>\n", | |
" <td>168</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Bronx</td>\n", | |
" <td>28</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Brooklyn</td>\n", | |
" <td>334</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>Staten Island</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
"</table>" | |
], | |
"text/plain": [ | |
"+---------------+------------+\n", | |
"| borough | permit_cnt |\n", | |
"+---------------+------------+\n", | |
"| Queens | 168 |\n", | |
"| Bronx | 28 |\n", | |
"| Brooklyn | 334 |\n", | |
"| Staten Island | 7 |\n", | |
"+---------------+------------+" | |
] | |
}, | |
"execution_count": 24, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%sql SELECT borough, count(*) permit_cnt FROM rmoff.permits GROUP BY borough" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "7482b2db", | |
"metadata": {}, | |
"source": [ | |
"# Where Next?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "814d43a5", | |
"metadata": {}, | |
"source": [ | |
"* For more information about write-audit-publish see [this talk from Michelle Winters](https://www.youtube.com/watch?v=fXHdeBnpXrg&t=1001s) and [this talk from Sam Redai](https://www.dremio.com/wp-content/uploads/2022/05/Sam-Redai-The-Write-Audit-Publish-Pattern-via-Apache-Iceberg.pdf)." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.10.10" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment