Last active
July 19, 2023 02:46
-
-
Save lsloan/b6e481c2d045ee6d1b768b04f8fa9a34 to your computer and use it in GitHub Desktop.
umich-udp-bigquery-to-pandas-dataframe.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"provenance": [], | |
"authorship_tag": "ABX9TyP8PIE5O2Mkb+bYHkgl3yA9", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"language_info": { | |
"name": "python" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/lsloan/b6e481c2d045ee6d1b768b04f8fa9a34/umich-udp-bigquery-to-pandas-dataframe.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# UMich UDP BigQuery to Pandas DataFrame\n", | |
"\n", | |
"Based on code snippets in Google Colab, these examples show how to get data directly from UMich's UDP instance in Google BigQuery into Pandas DataFrames." | |
], | |
"metadata": { | |
"id": "GTCG0CG7q6eI" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from google.colab import auth\n", | |
"auth.authenticate_user()" | |
], | |
"metadata": { | |
"id": "5JFB6Lz6PVKT" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"cloudProjectId = 'umich-udp-exploration'\n", | |
"bigqueryTableId = 'lecture_capture_exploration.640658_export'\n", | |
"sql = f'SELECT * FROM `{bigqueryTableId}`'" | |
], | |
"metadata": { | |
"id": "bl-phOllco5J" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Pandas API\n", | |
"\n", | |
"Using the Google BigQuery (`gbq`) module from Pandas requires very little code. One line in this example. It lacks some of Cloud API's flashy features, but it gets the job done. It may not allow defining default column types, but there might be a way to specify the `project_id` once in the environment. Fortunately, it supports the `tqdm` progress bar using the `progress_bar_type='tqdm_notebook'` argument.\n", | |
"\n", | |
"See: [`read_gbq()` documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_gbq.html)" | |
], | |
"metadata": { | |
"id": "FitC1Ot2pr_n" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"import pandas as pd\n", | |
"\n", | |
"df = pd.read_gbq(sql, project_id=cloudProjectId, progress_bar_type='tqdm_notebook')\n", | |
"display(df.head())" | |
], | |
"metadata": { | |
"id": "YPvTe3c7chnd" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Cloud API\n", | |
"\n", | |
"Using Google's Cloud API module for Python, there are a few advantages, but it requires a little more code. For example…\n", | |
"* The project ID may be given once when constructing the client object, then each subsequent query will not need to be told which project to use.\n", | |
"* `to_dataframe()` lets Python types be defined for each of various BigQuery types.\n", | |
"* `progress_bar_type='tqdm_notebook'` can be set to get a progress bar while the data is being loaded (and while the query is running), which is helpful with large datasets.\n", | |
"\n", | |
"See…\n", | |
"* [Download query results to DataFrame](https://cloud.google.com/bigquery/docs/samples/bigquery-query-results-dataframe) BigQuery code sample\n", | |
"* [`to_dataframe()` documentation](https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.QueryJob#google_cloud_bigquery_job_QueryJob_to_dataframe)" | |
], | |
"metadata": { | |
"id": "GkXTFPMHoAx6" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from google.cloud import bigquery\n", | |
"\n", | |
"client = bigquery.Client(project=cloudProjectId)\n", | |
"df0 = client.query(sql).to_dataframe(progress_bar_type='tqdm_notebook')\n", | |
"display(df0.head())" | |
], | |
"metadata": { | |
"id": "TwdbSbHyPVKU" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment