Created
October 8, 2024 22:54
-
-
Save advincze/11d8d0319df5a9db3011a0ee772e4d4d to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "view-in-github" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/advincze/725580cc4de3ca330dfcda72de5e2519/workshop-2-0-part1-homework.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "trv6KaKAj_Nm" | |
}, | |
"source": [ | |
"# Homework\n", | |
"\n", | |
"Write a python script that will load data from the Jikan API (Unofficial MyAnimeList API) into DuckDB using dlt and complete the tasks.\n", | |
"\n", | |
"Use:\n", | |
"- @dlt.sourse\n", | |
"- @dlt.resource\n", | |
"- @dlt.transformer\n", | |
"\n", | |
"Jikan API URL: https://api.jikan.moe/v4\n", | |
"\n", | |
"Docs: https://docs.api.jikan.moe/#section/Information\n", | |
"\n", | |
"Endpoints for loading:\n", | |
"- [anime](https://docs.api.jikan.moe/#tag/anime/operation/getAnimeSearch)\n", | |
"- [characters](https://docs.api.jikan.moe/#tag/characters/operation/getCharactersSearch)\n", | |
"- [anime/{id}/reviews](https://docs.api.jikan.moe/#tag/anime/operation/getAnimeReviews)\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "Qn_ODsGIiaTx" | |
}, | |
"source": [ | |
"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "-cFnLAMpIBNF" | |
}, | |
"source": [ | |
"# Install dlt with duckdb extention" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"id": "4uVKDZXdGkz3" | |
}, | |
"outputs": [], | |
"source": [ | |
"%%capture\n", | |
"!pip install dlt[duckdb]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "ZqBbuLiLIGwx" | |
}, | |
"source": [ | |
"# Play with API using requests library" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"id": "fHVNIdclQZzT" | |
}, | |
"outputs": [], | |
"source": [ | |
"import requests\n", | |
"response = requests.get(\"https://api.jikan.moe/v4/anime\")\n", | |
"# response.json()[\"data\"][0]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "cMZP5POeIKAW" | |
}, | |
"source": [ | |
"# Helper\n", | |
"Run the cell and ignore it." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"id": "nZIBJEKrWpTm" | |
}, | |
"outputs": [], | |
"source": [ | |
"from dlt.common.pipeline import LoadInfo\n", | |
"\n", | |
"def assert_load_info(info: LoadInfo, expected_load_packages: int = 1) -> None:\n", | |
" \"\"\"Asserts that expected number of packages was loaded and there are no failed jobs\"\"\"\n", | |
" assert len(info.loads_ids) == expected_load_packages\n", | |
" # all packages loaded\n", | |
" assert all(package.state == \"loaded\" for package in info.load_packages) is True\n", | |
" # no failed jobs in any of the packages\n", | |
" info.raise_on_failed_jobs()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "6kM1y10aXf2S" | |
}, | |
"source": [ | |
"# Task 1\n", | |
"\n", | |
"\n", | |
"- Create a pipeline for the API, for the next endpoints:\n", | |
" - anime;\n", | |
" - characters.\n", | |
"\n", | |
"- Fill the empty lines in the functions below.\n", | |
"- Create a [resource](https://dlthub.com/docs/general-usage/resource#declare-a-resource) for the `characters` endpoint from scratch.\n", | |
"- Group them in `dlt.source`.\n", | |
"- [Run the pipeline](https://dlthub.com/docs/walkthroughs/run-a-pipeline) without errors." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"id": "2He2O3XNGqnB" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Pipeline anime_pipeline load step completed in 1.12 seconds\n", | |
"1 load package(s) were loaded to destination duckdb and into dataset anime_data_20241008105126\n", | |
"The duckdb destination used duckdb:////Users/adam/src/github.com/advincze/dlt-workshop/anime_pipeline.duckdb location to store data\n", | |
"Load package 1728427886.70709 is LOADED and contains no failed jobs\n", | |
"Normalized data for the following tables:\n", | |
"- _dlt_pipeline_state: 1 row(s)\n", | |
"- anime: 25 row(s)\n", | |
"- anime__titles: 113 row(s)\n", | |
"- anime__producers: 78 row(s)\n", | |
"- anime__licensors: 31 row(s)\n", | |
"- anime__studios: 29 row(s)\n", | |
"- anime__genres: 79 row(s)\n", | |
"- anime__themes: 40 row(s)\n", | |
"- anime__title_synonyms: 25 row(s)\n", | |
"- anime__demographics: 17 row(s)\n", | |
"- characters: 25 row(s)\n", | |
"- characters__nicknames: 34 row(s)\n", | |
"\n", | |
"Load package 1728427886.70709 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs\n" | |
] | |
} | |
], | |
"source": [ | |
"import time\n", | |
"\n", | |
"import dlt\n", | |
"import requests\n", | |
"\n", | |
"\n", | |
"@dlt.resource(table_name=\"anime\")\n", | |
"def get_anime():\n", | |
" response = requests.get(f\"https://api.jikan.moe/v4/anime\")\n", | |
" page = response.json()\n", | |
" if \"data\" in page:\n", | |
" yield page[\"data\"]\n", | |
"\n", | |
"\n", | |
"# put your code here\n", | |
"@dlt.resource(table_name=\"characters\")\n", | |
"def get_characters():\n", | |
" response = requests.get(f\"https://api.jikan.moe/v4/characters\")\n", | |
" page = response.json()\n", | |
" if \"data\" in page:\n", | |
" yield page[\"data\"]\n", | |
"\n", | |
"@dlt.source(name=\"data\")\n", | |
"def data():\n", | |
" return get_anime, get_characters, \n", | |
"\n", | |
"\n", | |
"pipeline = dlt.pipeline(\n", | |
" pipeline_name='anime_pipeline',\n", | |
" destination='duckdb',\n", | |
" dataset_name='anime_data',\n", | |
" dev_mode=True,\n", | |
")\n", | |
"\n", | |
"load_info = pipeline.run(data())\n", | |
"print(load_info)\n", | |
"print(pipeline.last_trace.last_normalize_info)\n", | |
"assert_load_info(load_info)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "JcIaeCd3Yqk9" | |
}, | |
"source": [ | |
"Run the code below and\n", | |
"## Answer the Questions:\n", | |
"**1.1** How many tables were generated by dlt during the normalisation step (see \"Normalized data for the following tables:\")? Including tables created by dlt (e.g. `_dlt_pipeline_state`).\n", | |
"\n", | |
"**1.2** What anime has the highest score?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"id": "y77nnS3qGvmr" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"12\n" | |
] | |
} | |
], | |
"source": [ | |
"print(len(pipeline.last_trace.last_normalize_info.asdict()[\"load_packages\"][0][\"jobs\"]))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "dWU63bAhIEpw" | |
}, | |
"source": [ | |
"Look at the data:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"id": "MZll2qr_O9wg" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>title_english</th>\n", | |
" <th>score</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Monster</td>\n", | |
" <td>8.88</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>Cowboy Bebop</td>\n", | |
" <td>8.75</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>One Piece</td>\n", | |
" <td>8.72</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>Berserk</td>\n", | |
" <td>8.59</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>Neon Genesis Evangelion: The End of Evangelion</td>\n", | |
" <td>8.55</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>Cowboy Bebop: The Movie</td>\n", | |
" <td>8.38</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" title_english score\n", | |
"0 Monster 8.88\n", | |
"1 Cowboy Bebop 8.75\n", | |
"2 One Piece 8.72\n", | |
"3 Berserk 8.59\n", | |
"4 Neon Genesis Evangelion: The End of Evangelion 8.55\n", | |
"5 Cowboy Bebop: The Movie 8.38" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"import duckdb\n", | |
"# from google.colab import data_table\n", | |
"# data_table.enable_dataframe_formatter()\n", | |
"\n", | |
"# a database '<pipeline_name>.duckdb' was created in working directory so just connect to it\n", | |
"conn = duckdb.connect(f\"{pipeline.pipeline_name}.duckdb\")\n", | |
"conn.sql(f\"SET search_path = '{pipeline.dataset_name}'\")\n", | |
"stats_table = conn.sql(\"SELECT title_english, score FROM anime ORDER BY score DESC LIMIT 6;\").df()\n", | |
"display(stats_table)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "ECjfILYklDb6" | |
}, | |
"source": [ | |
"You could notice that we received only 1 page, only 25 records. But this endpoint has over 27k records in total (see below). To get all the pages you should use a pagination." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "bOsD8Appbjy4" | |
}, | |
"source": [ | |
"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "C1bgkFTQF2GO" | |
}, | |
"source": [ | |
"# Pagination\n", | |
"\n", | |
"Play with requests, check out what pagination parameters you can use." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"id": "CWpsefP0GD_4" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'last_visible_page': 1099,\n", | |
" 'has_next_page': True,\n", | |
" 'current_page': 1,\n", | |
" 'items': {'count': 25, 'total': 27466, 'per_page': 25}}" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import requests\n", | |
"response = requests.get(\"https://api.jikan.moe/v4/anime\")\n", | |
"response.json()['pagination']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "n5hb7QsqK6-a" | |
}, | |
"source": [ | |
"We can see that this endpoint has 27298 items in total.\n", | |
"\n", | |
"Let's load only new anime released (with status \"airing\") in the last 3 month:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"id": "KZPNGsv8Joij" | |
}, | |
"outputs": [], | |
"source": [ | |
"params = {\n", | |
" \"order_by\": \"start_date\",\n", | |
" \"start_date\": \"2024-07-01\",\n", | |
" \"status\": \"airing\",\n", | |
" \"page\": 1\n", | |
"}\n", | |
"\n", | |
"response = requests.get(\"https://api.jikan.moe/v4/anime\", params=params)\n", | |
"# print(response.json()[\"pagination\"])\n", | |
"# print(response.json()[\"data\"][0])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "Q9gObEHqZpB0" | |
}, | |
"source": [ | |
"# Task 2\n", | |
"- Add [pagination](https://www.getknit.dev/blog/api-pagination-best-practices) for endpoint `anime`.\n", | |
"- Add incremental loading for the resource `anime`. Use:\n", | |
" - `aired.from` as a cursor,\n", | |
" - `airing` status,\n", | |
" - `merge` write disposition,\n", | |
" - `mal_id` as a primary key.\n", | |
" - `dlt.sources.incremental`,\n", | |
" - \"2024-07-01T00:00:00+00:00\" as `initial_value`.\n", | |
"\n", | |
"- Run the pipeline without errors.\n", | |
"\n", | |
"Read more about [incremental loading](https://dlthub.com/docs/general-usage/incremental-loading).\n", | |
"\n", | |
"\n", | |
"WARNING: This API has [Rate Limits](https://docs.api.jikan.moe/#section/Information/Rate-Limiting). Try to reduce the number of pages during the development." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"id": "t7rLaW7B1lCM" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Pipeline anime_pipeline load step completed in 2.09 seconds\n", | |
"1 load package(s) were loaded to destination duckdb and into dataset anime_data_20241008105129\n", | |
"The duckdb destination used duckdb:////Users/adam/src/github.com/advincze/dlt-workshop/anime_pipeline.duckdb location to store data\n", | |
"Load package 1728427889.2651699 is LOADED and contains no failed jobs\n", | |
"Normalized data for the following tables:\n", | |
"- _dlt_pipeline_state: 1 row(s)\n", | |
"- anime: 134 row(s)\n", | |
"- anime__titles: 460 row(s)\n", | |
"- anime__producers: 357 row(s)\n", | |
"- anime__studios: 113 row(s)\n", | |
"- anime__genres: 279 row(s)\n", | |
"- anime__themes: 136 row(s)\n", | |
"- anime__demographics: 39 row(s)\n", | |
"- anime__title_synonyms: 102 row(s)\n", | |
"- anime__licensors: 6 row(s)\n", | |
"\n", | |
"Load package 1728427889.2651699 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs\n", | |
"Pipeline anime_pipeline load step completed in 1.03 seconds\n", | |
"1 load package(s) were loaded to destination duckdb and into dataset anime_data_20241008105129\n", | |
"The duckdb destination used duckdb:////Users/adam/src/github.com/advincze/dlt-workshop/anime_pipeline.duckdb location to store data\n", | |
"Load package 1728427896.7353642 is LOADED and contains no failed jobs\n", | |
"Normalized data for the following tables:\n", | |
"- _dlt_pipeline_state: 1 row(s)\n", | |
"\n", | |
"Load package 1728427896.7353642 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs\n" | |
] | |
} | |
], | |
"source": [ | |
"import time\n", | |
"import dlt\n", | |
"import requests\n", | |
"\n", | |
"\n", | |
"@dlt.resource(\n", | |
" table_name=\"anime\", \n", | |
" write_disposition=\"merge\", \n", | |
" primary_key=\"mal_id\"\n", | |
" )\n", | |
"def get_anime(\n", | |
" aired_from=dlt.sources.incremental(\"aired.from\", initial_value=\"2024-07-01T00:00:00+00:00\")\n", | |
"):\n", | |
" params = {\n", | |
" \"order_by\": \"start_date\",\n", | |
" \"start_date\": aired_from.last_value[:10],\n", | |
" \"status\": \"airing\",\n", | |
" \"page\": 1\n", | |
" }\n", | |
" while True:\n", | |
" page = get_with_backoff(f\"https://api.jikan.moe/v4/anime\",params=params)\n", | |
" if \"data\" in page:\n", | |
" yield page[\"data\"]\n", | |
" if not page[\"pagination\"][\"has_next_page\"]:\n", | |
" break\n", | |
" params[\"page\"] += 1\n", | |
" return\n", | |
"\n", | |
"def get_with_backoff(url, params, sleep=1):\n", | |
" response = requests.get(url,params=params)\n", | |
" res = response.json()\n", | |
" if res.get('status','') == '429':\n", | |
" time.sleep(sleep)\n", | |
" return get_with_backoff(url, params, sleep=sleep*2)\n", | |
" return res\n", | |
"\n", | |
"\n", | |
"pipeline = dlt.pipeline(\n", | |
" pipeline_name='anime_pipeline',\n", | |
" destination='duckdb',\n", | |
" dataset_name='anime_data',\n", | |
" dev_mode=True,\n", | |
")\n", | |
"\n", | |
"load_info = pipeline.run(get_anime())\n", | |
"print(load_info)\n", | |
"print(pipeline.last_trace.last_normalize_info)\n", | |
"assert_load_info(load_info)\n", | |
"\n", | |
"\n", | |
"load_info = pipeline.run(get_anime())\n", | |
"print(load_info)\n", | |
"print(pipeline.last_trace.last_normalize_info)\n", | |
"assert_load_info(load_info, expected_load_packages=1)\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "j2yWGz58jTQX" | |
}, | |
"source": [ | |
"## Answer the Questions:\n", | |
"**2.1.** How many rows has `anime__genres` table?\n", | |
"\n", | |
"**2.2** What anime has the highest score now?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"id": "W6NDcXSAORBZ" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>title_english</th>\n", | |
" <th>score</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Bleach: Thousand-Year Blood War - The Conflict</td>\n", | |
" <td>9.08</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>Monogatari Series: Off & Monster Season</td>\n", | |
" <td>9.02</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>Re:ZERO -Starting Life in Another World- Season 3</td>\n", | |
" <td>8.92</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>Dan Da Dan</td>\n", | |
" <td>8.67</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" title_english score\n", | |
"0 Bleach: Thousand-Year Blood War - The Conflict 9.08\n", | |
"1 Monogatari Series: Off & Monster Season 9.02\n", | |
"2 Re:ZERO -Starting Life in Another World- Season 3 8.92\n", | |
"3 Dan Da Dan 8.67" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"import duckdb\n", | |
"# from google.colab import data_table\n", | |
"# data_table.enable_dataframe_formatter()\n", | |
"\n", | |
"# a database '<pipeline_name>.duckdb' was created in working directory so just connect to it\n", | |
"conn = duckdb.connect(f\"{pipeline.pipeline_name}.duckdb\")\n", | |
"conn.sql(f\"SET search_path = '{pipeline.dataset_name}'\")\n", | |
"stats_table = conn.sql(\"SELECT title_english, score FROM anime ORDER BY score DESC LIMIT 4;\").df()\n", | |
"display(stats_table)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"┌──────────────┐\n", | |
"│ count_star() │\n", | |
"│ int64 │\n", | |
"├──────────────┤\n", | |
"│ 248 │\n", | |
"└──────────────┘" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"conn = duckdb.connect(f\"{pipeline.pipeline_name}.duckdb\")\n", | |
"conn.sql(f\"SET search_path = '{pipeline.dataset_name}'\")\n", | |
"conn.sql(\"SELECT count(*) FROM anime__genres \")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "diK6iTnA-gKz" | |
}, | |
"source": [ | |
"# Task 3: Get reviews with dlt.transformer\n", | |
"\n", | |
"- Use `@dlt.transformer` to get additional info for your data.\n", | |
"\n", | |
"- Use endpoint `anime/{id}/reviews` to get reviews for anime endpoint from task 1 (one page, without pagination and incremental loading).\n", | |
"- Don't forget to handle Rate Limits errors.\n", | |
"- Run pipeline without errors.\n", | |
"\n", | |
"Read more about dlt [transformers](https://dlthub.com/docs/general-usage/resource#process-resources-with-dlttransformer)." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "uuJg2JVrUJNA" | |
}, | |
"source": [ | |
"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": { | |
"id": "2pyeJXUB7MbR" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"fetch reviews 25\n", | |
"Pipeline anime_pipeline load step completed in 1.04 seconds\n", | |
"1 load package(s) were loaded to destination duckdb and into dataset anime_data_tr_20241008105138\n", | |
"The duckdb destination used duckdb:////Users/adam/src/github.com/advincze/dlt-workshop/anime_pipeline.duckdb location to store data\n", | |
"Load package 1728427898.4818618 is LOADED and contains no failed jobs\n", | |
"Normalized data for the following tables:\n", | |
"- _dlt_pipeline_state: 1 row(s)\n", | |
"- anime_reviews: 425 row(s)\n", | |
"- anime_reviews__tags: 436 row(s)\n", | |
"\n", | |
"Load package 1728427898.4818618 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs\n" | |
] | |
} | |
], | |
"source": [ | |
"import time\n", | |
"import dlt\n", | |
"import requests\n", | |
"\n", | |
"@dlt.resource(table_name=\"anime\")\n", | |
"def get_anime():\n", | |
" page = get_with_backoff(f\"https://api.jikan.moe/v4/anime\")\n", | |
" if \"data\" in page:\n", | |
" yield page[\"data\"]\n", | |
"\n", | |
"@dlt.transformer(data_from=get_anime, table_name=\"anime_reviews\")\n", | |
"def get_reviews(items):\n", | |
" print(\"fetch reviews\", len(items))\n", | |
" for item in items:\n", | |
" page = get_with_backoff(f\"https://api.jikan.moe/v4/anime/{item[\"mal_id\"]}/reviews\")\n", | |
" if page.get('status','') == '429':\n", | |
" time.sleep(1)\n", | |
" if \"data\" in page:\n", | |
" yield page[\"data\"]\n", | |
"\n", | |
"def get_with_backoff(url, params={}, sleep=1):\n", | |
" response = requests.get(url,params=params)\n", | |
" res = response.json()\n", | |
" if res.get('status','') == '429':\n", | |
" time.sleep(sleep)\n", | |
" return get_with_backoff(url, params, sleep=sleep*2)\n", | |
" return res\n", | |
"\n", | |
"pipeline = dlt.pipeline(\n", | |
" pipeline_name='anime_pipeline',\n", | |
" destination='duckdb',\n", | |
" dataset_name='anime_data_tr',\n", | |
" dev_mode=True,\n", | |
")\n", | |
"\n", | |
"data = get_anime | get_reviews\n", | |
"\n", | |
"load_info = pipeline.run(data())\n", | |
"print(load_info)\n", | |
"print(pipeline.last_trace.last_normalize_info)\n", | |
"assert_load_info(load_info)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "qUll84tQi1Ij" | |
}, | |
"source": [ | |
"Run the code below and\n", | |
"\n", | |
"## Answer the Questions:\n", | |
"\n", | |
"**3.1** How many tables were generated by dlt during the normalisation step (see \"Normalized data for the following tables:\")? Including tables created by dlt (e.g. `_dlt_pipeline_state`).\n", | |
"\n", | |
"**3.2** As you can see below, Archaeon got the most overall reactions.\n", | |
"Which username took second place?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": { | |
"id": "lkJz8HsrOYtj" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"┌────────────────┬───────────────────────────────────┬───────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬───────────┐\n", | |
"│ database │ schema │ name │ column_names │ column_types │ temporary │\n", | |
"│ varchar │ varchar │ varchar │ varchar[] │ varchar[] │ boolean │\n", | |
"├────────────────┼───────────────────────────────────┼───────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤\n", | |
"│ anime_pipeline │ anime_data_20241008105126 │ _dlt_loads │ [load_id, schema_name, status, inserted_at, schema_version_hash] │ [VARCHAR, VARCHAR, BIGINT, TIMESTAMP WITH TIME ZONE, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_20241008105126 │ _dlt_pipeline_state │ [version, engine_version, pipeline_name, state, created_at, version_hash, _dlt_load_id, _dlt_id] │ [BIGINT, BIGINT, VARCHAR, VARCHAR, TIMESTAMP WITH TIME ZONE, VARCHAR, VARCHAR, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_20241008105126 │ _dlt_version │ [version, engine_version, inserted_at, schema_name, version_hash, schema] │ [BIGINT, BIGINT, TIMESTAMP WITH TIME ZONE, VARCHAR, VARCHAR, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_20241008105126 │ anime │ [mal_id, url, images__jpg__image_url, images__jpg__small_image_url, images__jpg__large_image_url, images__webp__image_url, images__webp__small_image_url, images__webp__large_image_url, trailer__youtube_id, trailer__url, trailer__embed_url, trailer__images__image_url, trailer__images__small_image_url, trailer__images__medium_image_url, trailer__images__large_image_url, trailer__images__maximum_image_url, approved, title, title_english, title_japanese, type, source, episodes, status, airing, aired__from, aired__to, aired__prop__from__day, aired__prop__from__month, aired__prop__from__year, aired__prop__to__day, aired__prop__to__month, aired__prop__to__year, aired__string, duration, rating, score, scored_by, rank, popularity, members, favorites, synopsis, background, season, year, broadcast__day, broadcast__time, broadcast__timezone, broadcast__string, _dlt_load_id, _dlt_id] │ [BIGINT, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, BOOLEAN, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, BIGINT, VARCHAR, BOOLEAN, TIMESTAMP WITH TIME ZONE, TIMESTAMP WITH TIME ZONE, BIGINT, BIGINT, BIGINT, BIGINT, BIGINT, BIGINT, VARCHAR, VARCHAR, VARCHAR, DOUBLE, BIGINT, BIGINT, BIGINT, BIGINT, BIGINT, VARCHAR, VARCHAR, VARCHAR, BIGINT, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_20241008105126 │ anime__demographics │ [mal_id, type, name, url, _dlt_parent_id, _dlt_list_idx, _dlt_id] │ [BIGINT, VARCHAR, VARCHAR, VARCHAR, VARCHAR, BIGINT, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_20241008105126 │ anime__genres │ [mal_id, type, name, url, _dlt_parent_id, _dlt_list_idx, _dlt_id] │ [BIGINT, VARCHAR, VARCHAR, VARCHAR, VARCHAR, BIGINT, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_20241008105126 │ anime__licensors │ [mal_id, type, name, url, _dlt_parent_id, _dlt_list_idx, _dlt_id] │ [BIGINT, VARCHAR, VARCHAR, VARCHAR, VARCHAR, BIGINT, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_20241008105126 │ anime__producers │ [mal_id, type, name, url, _dlt_parent_id, _dlt_list_idx, _dlt_id] │ [BIGINT, VARCHAR, VARCHAR, VARCHAR, VARCHAR, BIGINT, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_20241008105126 │ anime__studios │ [mal_id, type, name, url, _dlt_parent_id, _dlt_list_idx, _dlt_id] │ [BIGINT, VARCHAR, VARCHAR, VARCHAR, VARCHAR, BIGINT, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_20241008105126 │ anime__themes │ [mal_id, type, name, url, _dlt_parent_id, _dlt_list_idx, _dlt_id] │ [BIGINT, VARCHAR, VARCHAR, VARCHAR, VARCHAR, BIGINT, VARCHAR] │ false │\n", | |
"│ · │ · │ · │ · │ · │ · │\n", | |
"│ · │ · │ · │ · │ · │ · │\n", | |
"│ · │ · │ · │ · │ · │ · │\n", | |
"│ anime_pipeline │ anime_data_20241008105129_staging │ anime__producers │ [mal_id, type, name, url, _dlt_root_id, _dlt_parent_id, _dlt_list_idx, _dlt_id] │ [BIGINT, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, BIGINT, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_20241008105129_staging │ anime__studios │ [mal_id, type, name, url, _dlt_root_id, _dlt_parent_id, _dlt_list_idx, _dlt_id] │ [BIGINT, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, BIGINT, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_20241008105129_staging │ anime__themes │ [mal_id, type, name, url, _dlt_root_id, _dlt_parent_id, _dlt_list_idx, _dlt_id] │ [BIGINT, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, BIGINT, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_20241008105129_staging │ anime__title_synonyms │ [value, _dlt_root_id, _dlt_parent_id, _dlt_list_idx, _dlt_id] │ [VARCHAR, VARCHAR, VARCHAR, BIGINT, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_20241008105129_staging │ anime__titles │ [type, title, _dlt_root_id, _dlt_parent_id, _dlt_list_idx, _dlt_id] │ [VARCHAR, VARCHAR, VARCHAR, VARCHAR, BIGINT, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_tr_20241008105138 │ _dlt_loads │ [load_id, schema_name, status, inserted_at, schema_version_hash] │ [VARCHAR, VARCHAR, BIGINT, TIMESTAMP WITH TIME ZONE, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_tr_20241008105138 │ _dlt_pipeline_state │ [version, engine_version, pipeline_name, state, created_at, version_hash, _dlt_load_id, _dlt_id] │ [BIGINT, BIGINT, VARCHAR, VARCHAR, TIMESTAMP WITH TIME ZONE, VARCHAR, VARCHAR, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_tr_20241008105138 │ _dlt_version │ [version, engine_version, inserted_at, schema_name, version_hash, schema] │ [BIGINT, BIGINT, TIMESTAMP WITH TIME ZONE, VARCHAR, VARCHAR, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_tr_20241008105138 │ anime_reviews │ [mal_id, url, type, reactions__overall, reactions__nice, reactions__love_it, reactions__funny, reactions__confusing, reactions__informative, reactions__well_written, reactions__creative, date, review, score, is_spoiler, is_preliminary, user__url, user__username, user__images__jpg__image_url, user__images__webp__image_url, _dlt_load_id, _dlt_id] │ [BIGINT, VARCHAR, VARCHAR, BIGINT, BIGINT, BIGINT, BIGINT, BIGINT, BIGINT, BIGINT, BIGINT, TIMESTAMP WITH TIME ZONE, VARCHAR, BIGINT, BOOLEAN, BOOLEAN, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR] │ false │\n", | |
"│ anime_pipeline │ anime_data_tr_20241008105138 │ anime_reviews__tags │ [value, _dlt_parent_id, _dlt_list_idx, _dlt_id] │ [VARCHAR, VARCHAR, BIGINT, VARCHAR] │ false │\n", | |
"├────────────────┴───────────────────────────────────┴───────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────┤\n", | |
"│ 41 rows (20 shown) 6 columns │\n", | |
"└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Index(['user__username', 'reactions__overall', 'review'], dtype='object')\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>user__username</th>\n", | |
" <th>reactions__overall</th>\n", | |
" <th>review</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Archaeon</td>\n", | |
" <td>3338</td>\n", | |
" <td>Since so many people have requested this over ...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>TheLlama</td>\n", | |
" <td>2232</td>\n", | |
" <td>People who know me know that I'm not a fan of ...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>CitizenGeek</td>\n", | |
" <td>1960</td>\n", | |
" <td>Note: this review is spoiler free. Hugely expe...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>TheCriticsClub</td>\n", | |
" <td>1955</td>\n", | |
" <td>Monster plays out like a macabre game of cat a...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>Polyphemus</td>\n", | |
" <td>1602</td>\n", | |
" <td>Cowboy Bebop is one of those series that is ju...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" user__username reactions__overall \\\n", | |
"0 Archaeon 3338 \n", | |
"1 TheLlama 2232 \n", | |
"2 CitizenGeek 1960 \n", | |
"3 TheCriticsClub 1955 \n", | |
"4 Polyphemus 1602 \n", | |
"\n", | |
" review \n", | |
"0 Since so many people have requested this over ... \n", | |
"1 People who know me know that I'm not a fan of ... \n", | |
"2 Note: this review is spoiler free. Hugely expe... \n", | |
"3 Monster plays out like a macabre game of cat a... \n", | |
"4 Cowboy Bebop is one of those series that is ju... " | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"import duckdb\n", | |
"# from google.colab import data_table\n", | |
"# data_table.enable_dataframe_formatter()\n", | |
"\n", | |
"# a database '<pipeline_name>.duckdb' was created in working directory so just connect to it\n", | |
"conn = duckdb.connect(f\"{pipeline.pipeline_name}.duckdb\")\n", | |
"conn.sql(f\"SET search_path = '{pipeline.dataset_name}'\")\n", | |
"# list all tables\n", | |
"display(conn.sql(\"DESCRIBE\"))\n", | |
"\n", | |
"stats_table = conn.sql(\"SELECT user__username, reactions__overall, review FROM anime_reviews ORDER BY reactions__overall DESC LIMIT 5;\").df()\n", | |
"print(stats_table.columns)\n", | |
"display(stats_table)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "sJY0iUQFxpB6" | |
}, | |
"source": [ | |
"# Bonus\n", | |
"There are no tasks, just a top 5 Anime according to\n", | |
" MyAnimeList database 😉" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": { | |
"id": "-3WBOUyhxtr8" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Pipeline anime_pipeline load step completed in 1.07 seconds\n", | |
"1 load package(s) were loaded to destination duckdb and into dataset anime_data_20241008105204\n", | |
"The duckdb destination used duckdb:////Users/adam/src/github.com/advincze/dlt-workshop/anime_pipeline.duckdb location to store data\n", | |
"Load package 1728427924.295403 is LOADED and contains no failed jobs\n", | |
"Normalized data for the following tables:\n", | |
"- _dlt_pipeline_state: 1 row(s)\n", | |
"- top_anime: 25 row(s)\n", | |
"- top_anime__titles: 156 row(s)\n", | |
"- top_anime__title_synonyms: 50 row(s)\n", | |
"- top_anime__producers: 132 row(s)\n", | |
"- top_anime__licensors: 22 row(s)\n", | |
"- top_anime__studios: 27 row(s)\n", | |
"- top_anime__genres: 72 row(s)\n", | |
"- top_anime__demographics: 18 row(s)\n", | |
"- top_anime__themes: 53 row(s)\n", | |
"\n", | |
"Load package 1728427924.295403 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs\n" | |
] | |
} | |
], | |
"source": [ | |
"import time\n", | |
"\n", | |
"import dlt\n", | |
"import requests\n", | |
"\n", | |
"\n", | |
"@dlt.resource(table_name=\"top_anime\")\n", | |
"def get_top_anime():\n", | |
" response = requests.get(\"https://api.jikan.moe/v4/top/anime\")\n", | |
" yield response.json()[\"data\"]\n", | |
"\n", | |
"\n", | |
"pipeline = dlt.pipeline(\n", | |
" pipeline_name='anime_pipeline',\n", | |
" destination='duckdb',\n", | |
" dataset_name='anime_data',\n", | |
" dev_mode=True,\n", | |
")\n", | |
"\n", | |
"load_info = pipeline.run(get_top_anime())\n", | |
"print(load_info)\n", | |
"print(pipeline.last_trace.last_normalize_info)\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": { | |
"id": "VE5ST3KVxyer" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>title_english</th>\n", | |
" <th>score</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Frieren: Beyond Journey's End</td>\n", | |
" <td>9.33</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>Fullmetal Alchemist: Brotherhood</td>\n", | |
" <td>9.09</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>Bleach: Thousand-Year Blood War - The Conflict</td>\n", | |
" <td>9.08</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>Steins;Gate</td>\n", | |
" <td>9.07</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>Gintama Season 4</td>\n", | |
" <td>9.06</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" title_english score\n", | |
"0 Frieren: Beyond Journey's End 9.33\n", | |
"1 Fullmetal Alchemist: Brotherhood 9.09\n", | |
"2 Bleach: Thousand-Year Blood War - The Conflict 9.08\n", | |
"3 Steins;Gate 9.07\n", | |
"4 Gintama Season 4 9.06" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"import duckdb\n", | |
"# from google.colab import data_table\n", | |
"# data_table.enable_dataframe_formatter()\n", | |
"\n", | |
"# a database '<pipeline_name>.duckdb' was created in working directory so just connect to it\n", | |
"conn = duckdb.connect(f\"{pipeline.pipeline_name}.duckdb\")\n", | |
"conn.sql(f\"SET search_path = '{pipeline.dataset_name}'\")\n", | |
"stats_table = conn.sql(\"SELECT title_english, score FROM top_anime ORDER BY score DESC LIMIT 5;\").df()\n", | |
"display(stats_table)" | |
] | |
} | |
], | |
"metadata": { | |
"colab": { | |
"include_colab_link": true, | |
"provenance": [], | |
"toc_visible": true | |
}, | |
"kernelspec": { | |
"display_name": ".venv", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.12.5" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment