camriddell · February 28, 2025 14:22
diff --git a/dutc-smarter-pandas-2025-02-28.ipynb b/dutc-smarter-pandas-2025-02-28.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5c336681",
   "metadata": {},
   "source": [
    "# Smarter pandas for Data Science\n",
    "\n",
    "## A Seminar by ‘Don’t Use This Code’\n",
    "\n",
    "**Presenter**: Cameron Riddell <[email protected]>\n",
    "\n",
    "<div style=\"display: flex; justify-content: center; font-size: 2em; width: auto; padding: .25em 5em .25em 5em;\">\n",
    "    <p style=\"text-align: center\">\n",
    "        Join us on <a href=\"https://discord.gg/ZhJPKYSfNp\">Discord (https://discord.gg/ZhJPKYSfNp)</a> for discussion and guidance!\n",
    "    </p>\n",
    "</div>\n",
    "\n",
    "## Contents\n",
    "\n",
    "1. [A Seminar by ‘Don’t Use This Code’](#a-seminar-by-‘don’t-use-this-code’)\n",
    "2. [Book a Class!](#book-a-class!)\n",
    "3. [About](#about)\n",
    "   * [Don’t Use This Code; Training & Consulting](#don’t-use-this-code;-training-&-consulting)\n",
    "\n",
    "## Book a Class!\n",
    "\n",
    "<big><big>Book a class or training for your team!</big></big>\n",
    "\n",
    "Please reach out to us at [[email protected]](mailto:[email protected]) if are\n",
    "interested in bringing this material, or any of our other material, to your\n",
    "team.\n",
    "\n",
    "We have courses on topics such as:\n",
    "- intro Python\n",
    "- expert Python\n",
    "- data engineering with Python\n",
    "- data science and scientific computing with `numpy`, `pandas`, and `xarray`\n",
    "\n",
    "If you reach out to us, we can also provide a printable copy of the notes,\n",
    "cleaned-up and in .pdf format, as well as a professionally edited video\n",
    "recording of this presentation.\n",
    "\n",
    "## About\n",
    "\n",
    "### Don’t Use This Code; Training & Consulting\n",
    "\n",
    "Don’t Use This Code is a professional training, coaching, and consulting\n",
    "company. We are deeply invested in the open source scientific computing\n",
    "community, and are dedicated to bringing better processes, better tools, and\n",
    "better understanding to the world.\n",
    "\n",
    "**Don’t Use This Code is growing! We are currently seeking new partners, new\n",
    "clients, and new engagements for our expert consulting and training\n",
    "services.**\n",
    "\n",
    "Our ideal client is an organization, large or small, using open source\n",
    "technologies, centering around the PyData stack for scientififc and numeric\n",
    "computing. Organizations looking to better employ these tools would benefit\n",
    "from our wide range of training courses on offer, ranging from an intensive\n",
    "introduction to Python fundamentals to advanced applications of Python for\n",
    "building large-scale, production systems. Working with your team, we can craft\n",
    "targeted curricula to meet your training goals. We are also available for\n",
    "consulting services such as building scientific computing and numerical\n",
    "analysis systems using technologies like Python and React.\n",
    "\n",
    "We pride ourselves on delivering top-notch training. We are committed to\n",
    "providing quality training that is uniquely valuable to each individual\n",
    "attendee, and we do so by investing in three key areas: our\n",
    "content, our processes, and our contributors.\n",
    "\n",
    "## Notes\n",
    "\n",
    "### Topics Covered"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f3c9505",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Let's Get Started!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f43fb50b",
   "metadata": {},
   "source": [
    "**Why use pandas**\n",
    "- Restricted Computation Domain\n",
    "    - compare Python list, numpy.ndarray, pandas.Series\n",
    "- What is Index alignment and why its important.\n",
    "\n",
    "**Simple analysis**\n",
    "- Loading data from files & in-memory\n",
    "- indexing/slicing/filtering\n",
    "- simple data exploration\n",
    "    - aggregations\n",
    "- simple plots\n",
    "\n",
    "**Data Cleaning**\n",
    "- column name normalization\n",
    "- mixed types: casting to specific dtypes\n",
    "- missing data\n",
    "- duplicates\n",
    "- unnecessary columns/rows\n",
    "- entity normalization\n",
    "\n",
    "**Grouped Operations**\n",
    "- apply, agg, transform\n",
    "    - use specific verbs (agg/transform > apply)\n",
    "    - avoid 1 function call per group\n",
    "\n",
    "**Reshaping Data**\n",
    "- wide vs long\n",
    "- melt         → wide to long\n",
    "- pivot        → long to wide\n",
    "- pivot_table  → long to wide w/ agg\n",
    "- stack        → wide to long (via the index)\n",
    "- unstack      → long to wide (via the index)\n",
    "- lreshape     → generic wide to long\n",
    "- wide_to_long → patterned widt to long\n",
    "\n",
    "## Why Use Pandas at all?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1bbb063",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Let's Get Started!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52673084",
   "metadata": {},
   "source": [
    "**Can’t we analyze tabular data in Python?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65dcb167",
   "metadata": {},
   "outputs": [],
   "source": [
    "from io import StringIO\n",
    "\n",
    "buffer = StringIO('''\n",
    "product,quantity,price\n",
    "apple,10,0.5\n",
    "banana,5,0.3\n",
    "apple,4,0.5\n",
    "banana,7,0.3\n",
    "orange,8,0.7\n",
    "'''.lstrip())\n",
    "\n",
    "# ① What product has the highest revenue (quantity * price)\n",
    "# ② How often does each product appear?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7713a21d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from io import StringIO\n",
    "\n",
    "buffer = StringIO('''\n",
    "product,quantity,price\n",
    "apple,10,0.5\n",
    "banana,5,0.3\n",
    "apple,4,0.5\n",
    "banana,7,0.3\n",
    "orange,8,0.7\n",
    "'''.lstrip())\n",
    "\n",
    "# Now using pandas…\n",
    "# ① What product has the highest revenue (quantity * price)\n",
    "# ② How often does each product appear?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6590624e",
   "metadata": {},
   "source": [
    "### Restricted Computation Domains\n",
    "\n",
    "*Array vs List Memory*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72e9acb9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sys import getsizeof\n",
    "from pandas import Series\n",
    "\n",
    "class T:\n",
    "    pass\n",
    "\n",
    "x = [1   for _ in range(10_000)]\n",
    "# x = [T() for _ in range(10_000)]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "581b8b45",
   "metadata": {},
   "source": [
    "*Array vs List Growing*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f0441f55",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "41a54127",
   "metadata": {},
   "source": [
    "*Array vs List Speed*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34e1ecea",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "1a8a7b18",
   "metadata": {},
   "source": [
    "**How Does NumPy Do This?**\n",
    "\n",
    "```zsh\n",
    "# credit: Jake Vanderplas\n",
    "# source: https://jakevdp.github.io/PythonDataScienceHandbook/figures/array_vs_list.png\n",
    "\n",
    "eog images/jake_vanderplas_numpy_list.png\n",
    "```\n",
    "\n",
    "### Data Types (preview)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "527951ba",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import Categorical, DataFrame, date_range, timedelta_range\n",
    "\n",
    "class T:\n",
    "    pass\n",
    "\n",
    "df = DataFrame({\n",
    "    'bool':         [True, False, True],\n",
    "    'int':          [1, 2, 3],\n",
    "    'float':        [1.1, 2.2, 3.3],\n",
    "    'datetime':     date_range('2000-01-01', periods=3, freq='d'),\n",
    "    'timedelta':    timedelta_range('1 day', periods=3),\n",
    "    'string':       ['a', 'b', 'c'],\n",
    "    'categorical':  Categorical(['a', 'b', 'c']),\n",
    "    'object':       ['a', 1, T()],\n",
    "}).astype({'string': 'string'})\n",
    "\n",
    "print(df)\n",
    "print(df.dtypes)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01b675da",
   "metadata": {},
   "source": [
    "- bool: stores True/False values.\n",
    "- int: stores whole numbers with optional signed/unsigned and bit-length variations.\n",
    "- float: stores decimal numbers with IEEE 754 double-precision floating-point representation.\n",
    "- datetime: Represents timestamps with nanosecond precision, supporting time zone awareness.\n",
    "- timedelta: Captures differences between timestamps, useful for time-based calculations.\n",
    "- string: Optimized string storage with variable-length UTF-8 encoding for efficient memory usage.\n",
    "- categorical: Stores repeated string-like values as integer codes mapped to categories, reducing memory usage.\n",
    "- object: Generic type storing arbitrary Python objects, often inefficient for large datasets.\n",
    "\n",
    "### Core Pieces\n",
    "\n",
    "A Restricted Computation Domain is an \"area\" where one\n",
    "can perform *fast* computations, only if one abides by the rules\n",
    "of that space.\n",
    "\n",
    "*pandas.Series*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddb2c8c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import Series\n",
    "\n",
    "# ① construction\n",
    "# ② .dtype\n",
    "# ③ .index\n",
    "# ④ .loc\n",
    "# ⑤ .iloc\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9b7d1d5",
   "metadata": {},
   "source": [
    "*pandas.DataFrame*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6385619",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "\n",
    "# ① construction\n",
    "# ② .dtypes\n",
    "# ③ .index\n",
    "# ④ .columns\n",
    "# ⑤ .loc\n",
    "# ⑥ .iloc"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f8e1f44",
   "metadata": {},
   "source": [
    "### Rules of Domain Operations\n",
    "\n",
    "*NumPy Broadcasting*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d776915d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import array\n",
    "\n",
    "xs = array([ 1,  2,  3,  4])\n",
    "ys = array([10, 20, 30, 40])\n",
    "\n",
    "# ① array vs scalar\n",
    "# ② array vs array"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f3c154c",
   "metadata": {},
   "source": [
    "*pandas Index Alignment*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4baa51fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import Series\n",
    "\n",
    "s1 = Series([ 1,  2,  3,  4], index=['a', 'b', 'c', 'd'          ])\n",
    "s2 = Series([10, 20, 30, 40], index=[     'b',      'd', 'e', 'f'])\n",
    "\n",
    "# ① adding 2 series\n",
    "# ② .reindex\n",
    "# ③ scalar broadcast (equality check)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d55a59d",
   "metadata": {},
   "source": [
    "## Simple Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73bdd11f",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Let's Get Started!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcbf58b8",
   "metadata": {},
   "source": [
    "### Loading Data\n",
    "\n",
    "#### DataFrames IO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a65b7434",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas\n",
    "\n",
    "for fname in dir(pandas):\n",
    "    if not fname.startswith('read_'):\n",
    "        continue\n",
    "    print(fname)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9fab66af",
   "metadata": {},
   "outputs": [],
   "source": [
    "from io import StringIO\n",
    "from textwrap import dedent\n",
    "\n",
    "buffer = StringIO(dedent('''\n",
    "    col1,col2\n",
    "    1,2\n",
    "    3,4\n",
    "''').strip())\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "681a5d0e",
   "metadata": {},
   "source": [
    "**text vs binary based formats**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f240ff5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame, date_range, Categorical, read_csv, read_parquet\n",
    "from io import StringIO, BytesIO\n",
    "from contextlib import contextmanager\n",
    "\n",
    "df = DataFrame({\n",
    "\t'date': date_range('2000-01-01', periods=4, freq='ME'),\n",
    "\t'ticker': Categorical(['abcd', 'efgh', 'ijkl', 'mnop']),\n",
    "\t'value': [100, 200, 300, 400],\n",
    "}).apply(lambda s: s.repeat(10_000))\n",
    "\n",
    "df.to_csv('data/test.csv')\n",
    "df.to_parquet('data/test.parquet')\n",
    "\n",
    "# ① Round trip\n",
    "# ② On-disk size"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f11803b",
   "metadata": {},
   "source": [
    "#### Comon DataFrames Constructors\n",
    "\n",
    "from_dict\n",
    "from_records\n",
    "json_normalize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc016f06",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "\n",
    "# df = DataFrame({'a': [1,2,3]})\n",
    "# df = DataFrame([[1],[2],[3]], columns=['a'])\n",
    "# df = DataFrame([{'a': 1}, {'a': 2}, {'a': 3}])\n",
    "\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f8229d3",
   "metadata": {},
   "source": [
    "*json vs records — nested data*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d052edc",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = [\n",
    "  {\n",
    "    \"order_id\": 1,\n",
    "    \"customer\": {\n",
    "      \"name\": \"Alice\",\n",
    "      \"email\": \"[email protected]\"\n",
    "    },\n",
    "    \"items\": [\n",
    "      {\"product\": \"Laptop\", \"price\": 999.99, \"quantity\": 1},\n",
    "      {\"product\": \"Mouse\", \"price\": 19.99, \"quantity\": 2}\n",
    "    ],\n",
    "    \"shipping_address\": {\n",
    "      \"street\": \"123 Elm St\",\n",
    "      \"city\": \"Metropolis\"\n",
    "    }\n",
    "  },\n",
    "  {\n",
    "    \"order_id\": 2,\n",
    "    \"customer\": {\n",
    "      \"name\": \"Bob\",\n",
    "      \"email\": \"[email protected]\"\n",
    "    },\n",
    "    \"items\": [\n",
    "      {\"product\": \"Keyboard\", \"price\": 49.99, \"quantity\": 1}\n",
    "    ],\n",
    "    \"shipping_address\": {\n",
    "      \"street\": \"456 Oak St\",\n",
    "      \"city\": \"Smallville\"\n",
    "    }\n",
    "  }\n",
    "]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08b9a3d1",
   "metadata": {},
   "source": [
    "#### Selecting Columns & Rows\n",
    "\n",
    "**projection**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3eddc66b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "\n",
    "df = DataFrame({\n",
    "    'a': [1, 2, 3],\n",
    "    'b': [4, 5, 6],\n",
    "    'c': [*'xyz'],\n",
    "})\n",
    "\n",
    "# ① .__getitem__ → scalar, list, boolean\n",
    "# ② .filter → items, like, regex\n",
    "# ③ .select_dtypes(…)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47d3d269",
   "metadata": {},
   "source": [
    "**filtering**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7903d5f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from itertools import chain, repeat\n",
    "from pandas import DataFrame\n",
    "\n",
    "df = DataFrame({\n",
    "    'group': [*chain(repeat('a', 5), repeat('b', 5))],\n",
    "    'value': [*range(10)],\n",
    "})\n",
    "\n",
    "# ① .loc → scalar, slice, list, boolean, callable\n",
    "# ② .iloc → scalar, slice, list\n",
    "# ③ .drop(…) → remove rows by index\n",
    "# ④ .query → convenience syntax"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "550e4b03",
   "metadata": {},
   "source": [
    "### Views, Copies, & SettingWithCopyWarning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a1bafb4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "\n",
    "df = DataFrame({'A': [1, 1, 1, 2, 2]}, columns=['A'])\n",
    "subset = df.loc[0:3]\n",
    "subset['A'] = 'a'\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b51263ac",
   "metadata": {},
   "source": [
    "### “Feeling” Your Data\n",
    "\n",
    "sample\n",
    "info\n",
    "describe\n",
    "sort_values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7fb65fdf",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "245bc21c",
   "metadata": {},
   "source": [
    "### Aggregation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24850634",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "483e43c4",
   "metadata": {},
   "source": [
    "### Hows The Weather?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04eb82e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from io import StringIO\n",
    "\n",
    "buffer = StringIO('''\n",
    "      date  temperature  humidity  precipitation\n",
    "2020-01-01           73       215           True\n",
    "2020-01-02           77       217           True\n",
    "2020-01-03           74       218           True\n",
    "2020-01-04           NA       219          False\n",
    "2020-01-05           78       220          False\n",
    "2020-01-08           72       221           True\n",
    "2020-01-11           90       209           True\n",
    "2020-01-18           85       231           True\n",
    "2020-02-01           84       211           True\n",
    "'''.lstrip())\n",
    "\n",
    "# What is the weather on January 2nd?\n",
    "# How many days did it rain? How many days did it not rain?\n",
    "# What days did it rain?\n",
    "# What days were hotter than the preceding day?\n",
    "# What was the hottest day when it rained?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7abfa6ca",
   "metadata": {},
   "source": [
    "*Plot the temperature and precipitation*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d21d401d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from io import StringIO\n",
    "\n",
    "buffer = StringIO('''\n",
    "      date  temperature  humidity  precipitation\n",
    "2020-01-01           73       215           True\n",
    "2020-01-02           77       217           True\n",
    "2020-01-03           74       218           True\n",
    "2020-01-04           NA       219          False\n",
    "2020-01-05           78       220          False\n",
    "2020-01-08           72       221           True\n",
    "2020-01-11           90       209           True\n",
    "2020-01-18           85       231           True\n",
    "2020-02-01           84       211           True\n",
    "'''.lstrip())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f330442b",
   "metadata": {},
   "source": [
    "Further Reading\n",
    "- [Don’t Forget About the Index](https://www.dontusethiscode.com/blog/2023-07-05_pandas_index.html)\n",
    "- [Python Sets & the pandas Index](https://www.dontusethiscode.com/blog/2024-03-06_indexes_and_sets.html)\n",
    "- [Good pandas needs good Python](https://www.dontusethiscode.com/blog/2024-01-24_pandas_needs_python.html)\n",
    "\n",
    "## Preparing for Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa4c9d53",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Let's Get Started!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9607674",
   "metadata": {},
   "source": [
    "Like many things in life, data work also follows the 80/20 rule. You spend…\n",
    "- 40% of your time cleaning data\n",
    "- 20% analyzing\n",
    "- another 40% cleaning it better than you did the first time.\n",
    "\n",
    "### Practice Problem\n",
    "\n",
    "who made this?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7bfe610",
   "metadata": {},
   "outputs": [],
   "source": [
    "from io import StringIO\n",
    "\n",
    "buffer = StringIO('''\n",
    "customer_id,date,amount,country\n",
    "001,2024/01/01,$100.50,US\n",
    "002,2024-01-05,$50.25,UK\n",
    ",2024-01-08,,$50.75\n",
    "004,2024-01-11,$75,Canada\n",
    "005,,,$0.00\n",
    "006,2024-01-14,$30.10,UK\n",
    "007,2024-01-16,$45.00,US\n",
    "008,2024-01-20,$60.75,\n",
    "''')\n",
    "\n",
    "# ① Identify issues with this dataset\n",
    "# ② Load & clean"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ad7e395",
   "metadata": {},
   "source": [
    "comma separated, but not tabular"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46cdbe53",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import read_csv, DataFrame\n",
    "from io import StringIO\n",
    "from textwrap import dedent\n",
    "\n",
    "buffer = StringIO(dedent('''\n",
    "    device,upgrade_dates\n",
    "    device-1,2000-01-01,2000-02-01,2000-03-01\n",
    "    device-2,2000-01-01,2000-04-01\n",
    "    device-3,2000-01-01,2000-03-01,2000-05-01,2000-10-01\n",
    "    device-4,2000-01-01,2000-07-01,2000-09-01\n",
    "''').strip())\n",
    "\n",
    "# ① Load with 2 columns, where upgrade_dates is a column of lists\n",
    "# ② Load with 2 columns, where upgrade_dates is \"exploded\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1281ba6d",
   "metadata": {},
   "source": [
    "### Help! My Data Are Missing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8237b2a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import Series\n",
    "\n",
    "s = Series([1, 2, None, 4, None, 6, None])\n",
    "\n",
    "# ① scalar fill\n",
    "# ② forward fill\n",
    "# ③ backward fill\n",
    "# ④ interpolation\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb900566",
   "metadata": {},
   "source": [
    "### More Dtypes & Dtype Backends\n",
    "\n",
    "Lets dive into these features starting with a closer look at a couple of\n",
    "data types.\n",
    "\n",
    "**Datetimes**\n",
    "\n",
    "- Datetime\n",
    "- Timedelta\n",
    "- Period"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb336693",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame, date_range\n",
    "\n",
    "df = DataFrame({\n",
    "    'date': date_range('2000-01-01', periods=5, freq='min'),\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de34f5ce",
   "metadata": {},
   "source": [
    "*useful datetime index*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a00591f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import Series, date_range\n",
    "from numpy import arange\n",
    "\n",
    "s = Series(\n",
    "    index=date_range('2023-01-05', periods=100, freq='5h'),\n",
    "    data=arange(100),\n",
    "    name='signal 1',\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30362c78",
   "metadata": {},
   "source": [
    "*Resampling*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8d4daed",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import Series, to_datetime, to_timedelta\n",
    "from numpy.random import default_rng\n",
    "\n",
    "rng = default_rng(0)\n",
    "\n",
    "s = Series(\n",
    "    100 * rng.normal(1, scale=.001, size=(size := 5_000)).cumprod(),\n",
    "    index=(\n",
    "        to_datetime('2023-02-15')\n",
    "        + to_timedelta(rng.integers(100, 1_000, size=size).cumsum(), unit='s')\n",
    "    )\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02d6bc16",
   "metadata": {},
   "source": [
    "**Categorical**\n",
    "\n",
    "Issue: strings take up more memory than integers do."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8d099c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import Series\n",
    "\n",
    "s = Series(['a', 'b', 'c', 'a', 'a', 'b'])\n",
    "\n",
    "# ① memory footprint of strings vs integers\n",
    "# ② factorize with hashmap\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ce47284",
   "metadata": {},
   "source": [
    "### Dtype Backends\n",
    "\n",
    "A new(er) feature of pandas is to have different array backends for in-memory\n",
    "data storage.\n",
    "\n",
    "NumPy vs Numpy[nullable] vs Pyarrow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ae5b65e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame, concat\n",
    "\n",
    "df = DataFrame({\n",
    "    'int':   [1  , 2  , 3  ],\n",
    "    'float': [1.1, 2.2, 3.3],\n",
    "})\n",
    "\n",
    "orig = concat(\n",
    "    [\n",
    "        df.add_prefix('numpy_'),\n",
    "        *(\n",
    "            df.convert_dtypes(dtype_backend=backend)\n",
    "            .add_prefix(f'{backend}_')\n",
    "            for backend in ['numpy_nullable', 'pyarrow']\n",
    "        )\n",
    "    ],\n",
    "    axis=1\n",
    ")\n",
    "\n",
    "df = orig.copy()\n",
    "df.loc[0, 'numpy_int'] = None\n",
    "df.loc[0, 'numpy_nullable_int'] = None\n",
    "df.loc[0, 'pyarrow_int'] = None\n",
    "\n",
    "print(\n",
    "    orig,\n",
    "    df,\n",
    "    DataFrame({\n",
    "        'orig': orig.dtypes,\n",
    "        'new' : df.dtypes,\n",
    "    }),\n",
    "    sep='\\n{}\\n'.format('\\N{box drawings light horizontal}' * 80)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60b52a57",
   "metadata": {},
   "source": [
    "**When Does This Really Make a Difference?**\n",
    "\n",
    "Working with strings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f8f25d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from random import Random\n",
    "from string import ascii_lowercase\n",
    "from pandas import Series\n",
    "from _lib import timed\n",
    "\n",
    "rnd = Random(0)\n",
    "chars = [\n",
    "    ''.join(rnd.choices(\n",
    "        [*ascii_lowercase], k=rnd.randint(2, 10)\n",
    "    ))\n",
    "    for _ in range(1_000_000)\n",
    "]\n",
    "\n",
    "s = Series(chars)\n",
    "s2 = s.astype('string[pyarrow]')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b85c972",
   "metadata": {},
   "source": [
    "Further Reading\n",
    "- [Dealing with dates in pandas (part 1)](https://www.dontusethiscode.com/blog/2023-02-01_datetimes-pandas.html)\n",
    "- [Dealing with dates in pandas (part 2)](https://www.dontusethiscode.com/blog/2023-02-08_datetimes-pandas-2.html)\n",
    "- [Dealing with dates in pandas (part 3)](https://www.dontusethiscode.com/blog/2023-02-15_datetimes-pandas-3.html)\n",
    "- [Using a Categorical to Represent Tag Sets](https://www.dontusethiscode.com/blog/2024-06-19_enum-categoricals.html)\n",
    "- [pandas: Months, Days, and Categoricals](https://www.dontusethiscode.com/blog/2024-08-28_months_days_categoricals.html)\n",
    "- [pandas/Polars NaNs vs Null](https://www.dontusethiscode.com/blog/2025-01-08_null_vs_nan.html)\n",
    "- [More NaNs & Nulls in pandas/Polars](https://www.dontusethiscode.com/blog/2025-01-22_null_vs_nan_applied.html)\n",
    "\n",
    "# Group by"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2361575f",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Let's Get Started!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c82d65be",
   "metadata": {},
   "source": [
    "*modalities: apply, agg, transform*\n",
    "but what about filter?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d834c28",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame, Index\n",
    "from numpy import repeat\n",
    "from numpy.random import default_rng\n",
    "\n",
    "rng = default_rng(0)\n",
    "\n",
    "df = DataFrame(\n",
    "    data={\n",
    "        'group':   repeat([\"a\", \"b\", \"c\"], 3),\n",
    "        'values1': [34, 25, 20, 10, 12, 1, 3, 0, 7],\n",
    "        'values2': [39, 41, 15, 20, 25, 31, 10, 9, 4],\n",
    "    }\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0a4bda9",
   "metadata": {},
   "source": [
    "### What’s Fastest?\n",
    "\n",
    "Grouped, weighted means"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "954b9684",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy.random import default_rng\n",
    "from pandas import DataFrame\n",
    "\n",
    "rng = default_rng(0)\n",
    "df = DataFrame({\n",
    "    'groups' : rng.choice([*'ABCDEF'], size=(size := 10_000), replace=True),\n",
    "    'weights': rng.uniform(0, 2,   size=(size := 10_000)),\n",
    "    'values' : rng.normal(100, 10, size=size),\n",
    "})\n",
    "\n",
    "# ① everything in groupby.apply\n",
    "# ② temp column, then groupby apply\n",
    "# ③ temp column, groupby agg (no user-defined function)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "184ea333",
   "metadata": {},
   "source": [
    "Further Reading\n",
    "- [Split → ??? → Combine](https://www.dontusethiscode.com/blog/2022-09-21_groupby-split-apply-combine.html)\n",
    "- [pandas & Polars: group by & window functions](https://www.dontusethiscode.com/blog/2024-06-05_dataframe-window.html)\n",
    "\n",
    "## Reshaping Data in pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9720bfb",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Let's Get Started!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36b64529",
   "metadata": {},
   "source": [
    "### Data Shapes\n",
    "\n",
    "Wide vs Long"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a00d4c07",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame, NA\n",
    "\n",
    "wide_df = DataFrame({\n",
    "    'product': [*'ABCD'],\n",
    "    'jan'    : [NA, 11, 12, 13],\n",
    "    'feb'    : [14, 15, 16, 17],\n",
    "    'mar'    : [18, 19, NA, NA],\n",
    "}).convert_dtypes('numpy_nullable')\n",
    "\n",
    "long_df = (\n",
    "    DataFrame({\n",
    "        'product': [*'BCDABCDAB'],\n",
    "        'month'  : [*'jjjffffmm'],\n",
    "        'sales'  : [*range(11, 11+9)],\n",
    "    })\n",
    "    .assign( # replace 'j' → 'jan', etc.\n",
    "        month=lambda d:\n",
    "            d['month'].map({'j': 'jan', 'f': 'feb', 'm': 'mar'})\n",
    "    )\n",
    ")\n",
    "\n",
    "print(\n",
    "    wide_df,\n",
    "    long_df,\n",
    "    sep='\\n\\n',\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82eee924",
   "metadata": {},
   "source": [
    "### Standard Reshaping Functions\n",
    "\n",
    "- melt        → wide to long\n",
    "- pivot       → long to wide\n",
    "- pivot_table → long to wide w/ agg\n",
    "- stack       → wide to long\n",
    "- unstack     → long to wide\n",
    "\n",
    "**wide → long**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e53ca93f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame, NA\n",
    "\n",
    "wide_df = DataFrame({\n",
    "    'product': [*'ABCD'],\n",
    "    'jan'    : [NA, 11, 12, 13],\n",
    "    'feb'    : [14, 15, 16, 17],\n",
    "    'mar'    : [18, 19, NA, NA],\n",
    "}).convert_dtypes()\n",
    "\n",
    "# ① melt\n",
    "# ② stack"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b484d87c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame, NA\n",
    "\n",
    "wide_df = DataFrame({\n",
    "    'product': [*'ABCD'],\n",
    "    'jan_sales'  : [NA, 11, 12, 13],\n",
    "    'jan_units'  : [NA,  1,  2,  3],\n",
    "    'jan_clicks' : [NA, 71, 72, 73],\n",
    "    'feb_sales'  : [14, 15, 16, 17],\n",
    "    'feb_units'  : [ 4,  5,  6,  7],\n",
    "    'feb_clicks' : [74, 75,  6, 77],\n",
    "    'mar_sales'  : [18, 19, NA, NA],\n",
    "    'mar_units'  : [8 ,  9, NA, NA],\n",
    "    'mar_clicks' : [78, 79, NA, NA],\n",
    "})\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a08f1f4",
   "metadata": {},
   "source": [
    "**long → wide**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ab98134",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "\n",
    "long_df = (\n",
    "    DataFrame({\n",
    "        'product': [*'BCDABCDAB'],\n",
    "        'month'  : [*'jjjffffmm'],\n",
    "        'sales'  : [*range(11, 11+9)],\n",
    "    })\n",
    "    .assign( # replace 'j' → 'jan', etc.\n",
    "        month=lambda d:\n",
    "            d['month'].map({'j': 'jan', 'f': 'feb', 'm': 'mar'})\n",
    "    )\n",
    ")\n",
    "\n",
    "# ① pivot\n",
    "# ② unstack"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7f0812d",
   "metadata": {},
   "source": [
    "### Working with multiple DataFrames\n",
    "\n",
    "- `pandas.merge` joins two DataFrames horizontally on a specific column (or optionally, the index).\n",
    "- `pandas.join` joins two DataFrames horizontally on their indexes.\n",
    "- `pandas.concat` joins many DataFrames either horizontally or vertically according to their row/column index.\n",
    "\n",
    "#### merge & join"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c52944f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "\n",
    "df_left = DataFrame({\n",
    "    'value': [ 1 ,  2 ,  3 ,  4            ],\n",
    "    'group': ['a', 'b', 'c', 'd'           ],\n",
    "})\n",
    "df_right = DataFrame({\n",
    "    'group': [          'c', 'd', 'e', 'f' ],\n",
    "    'value': [ -3,  -4,  -5,  -6 ],\n",
    "})\n",
    "\n",
    "# ① merge vs join \n",
    "# ② left, right, inner, outer, cross"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5514df9",
   "metadata": {},
   "source": [
    "#### Align → Work\n",
    "\n",
    "We have two signals: one is an event-based signal while the other is a polling\n",
    "signal. The difference here is that the event signal records the timestamp when\n",
    "an event occurred, whereas the polling signal gives us back the result at the\n",
    "time of check."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca94c970",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame, Timedelta\n",
    "\n",
    "state_polling_df = DataFrame(\n",
    "    columns=[      'timestamp', 'device_id',  'state'],\n",
    "    data=[\n",
    "        ['2000-01-01 04:00:00',       'abc', 'state1'],\n",
    "        ['2000-01-01 04:30:00',       'abc', 'state1'],\n",
    "        ['2000-01-01 05:00:00',       'abc', 'state1'],\n",
    "        ['2000-01-01 05:30:00',       'abc', 'state3'],\n",
    "        ['2000-01-01 06:00:00',       'abc', 'state3'],\n",
    "        ['2000-01-01 06:30:00',       'abc', 'state2'],\n",
    "        ['2000-01-01 07:00:00',       'abc', 'state2'],\n",
    "        ['2000-01-01 07:30:00',       'abc', 'state1'],\n",
    "\n",
    "        ['2000-01-01 04:00:00',       'def', 'state2'],\n",
    "        ['2000-01-01 04:30:00',       'def', 'state1'],\n",
    "        ['2000-01-01 05:00:00',       'def', 'state3'],\n",
    "        ['2000-01-01 05:30:00',       'def', 'state3'],\n",
    "        ['2000-01-01 06:00:00',       'def', 'state1'],\n",
    "        ['2000-01-01 06:30:00',       'def', 'state1'],\n",
    "    ]\n",
    ").astype({'timestamp': 'datetime64[ns]'})\n",
    "\n",
    "alert_events = DataFrame(\n",
    "    columns=[      'timestamp', 'device_id'],\n",
    "    data=[\n",
    "        ['2000-01-01 03:15:00',       'abc'],\n",
    "        ['2000-01-01 04:05:00',       'abc'],\n",
    "        ['2000-01-01 04:17:00',       'abc'],\n",
    "        ['2000-01-01 04:44:00',       'abc'],\n",
    "        ['2000-01-01 05:10:00',       'abc'],\n",
    "        ['2000-01-01 05:23:00',       'abc'],\n",
    "        ['2000-01-01 05:43:00',       'abc'],\n",
    "        ['2000-01-01 05:53:00',       'abc'],\n",
    "        ['2000-01-01 06:02:00',       'abc'],\n",
    "        ['2000-01-01 06:08:00',       'abc'],\n",
    "        ['2000-01-01 06:10:00',       'abc'],\n",
    "        ['2000-01-01 06:23:00',       'abc'],\n",
    "        ['2000-01-01 06:51:00',       'abc'],\n",
    "\n",
    "        ['2000-01-01 03:05:00',       'def'],\n",
    "        ['2000-01-01 04:15:00',       'def'],\n",
    "        ['2000-01-01 04:27:00',       'def'],\n",
    "        ['2000-01-01 04:34:00',       'def'],\n",
    "        ['2000-01-01 05:20:00',       'def'],\n",
    "        ['2000-01-01 05:33:00',       'def'],\n",
    "        ['2000-01-01 06:22:00',       'def'],\n",
    "        ['2000-01-01 06:29:00',       'def'],\n",
    "        ['2000-01-01 06:43:00',       'def'],\n",
    "        ['2000-01-01 07:01:00',       'def'],\n",
    "    ]\n",
    ").astype({'timestamp': 'datetime64[ns]'})\n",
    "\n",
    "# How many alert events occurred within each device & state?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4fba6afc",
   "metadata": {},
   "source": [
    "#### concat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ace2234",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import arange\n",
    "import pandas as pd\n",
    "\n",
    "data = arange(0, 18).reshape(3, 3, 2)\n",
    "\n",
    "# take note where these indexes and columns overlap and where they do not\n",
    "dfs = {\n",
    "    'df1': pd.DataFrame(data[0], index=['a', 'b', 'c'          ], columns=['x', 'y'     ]),\n",
    "    'df2': pd.DataFrame(data[1], index=[     'b', 'c', 'd'     ], columns=['x',      'z']),\n",
    "    'df3': pd.DataFrame(data[2], index=[     'b',      'd', 'e'], columns=['x',      'z']),\n",
    "}\n",
    "\n",
    "# concat across axes\n",
    "# concat inner/outer"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bfa89f0",
   "metadata": {},
   "source": [
    "##### pandas.concat Performance Considerations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7580dfe",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import Series, DataFrame, concat\n",
    "\n",
    "def append(self, other):\n",
    "\treturn concat([self, other])\n",
    "\n",
    "Series.append = append\n",
    "DataFrame.append = append\n",
    "\n",
    "s = Series([1, 2, 3])\n",
    "print(\n",
    "\ts.append(Series([4, 5, 6]))\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3ec96c7",
   "metadata": {},
   "source": [
    "- list → designed to grow in size\n",
    "- pandas.Series/numpy.array → static size, needs reallocation to grow\n",
    "- pandas.DataFrame → doesn’t want to change the number of rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f1d3b46",
   "metadata": {},
   "outputs": [],
   "source": [
    "from itertools import islice\n",
    "from _lib import timed\n",
    "\n",
    "from pandas import read_csv, DataFrame, concat\n",
    "from pathlib import Path\n",
    "from numpy.random import default_rng\n",
    "from tempfile import TemporaryDirectory\n",
    "\n",
    "def data_gen(rng):\n",
    "    while True:\n",
    "        yield DataFrame(\n",
    "            rng.normal(10, 3, size=(10_000, 5)), columns=[*'abcde']\n",
    "        )\n",
    "rng = default_rng(0)\n",
    "\n",
    "def append(self, other):\n",
    "\treturn concat([self, other])\n",
    "DataFrame.append = append\n",
    "\n",
    "# time repeated concat/append vs concat on list"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae46ef75",
   "metadata": {},
   "source": [
    "*Adding rows vs adding Columns*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2a55e518",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "from _lib import timed\n",
    "from warnings import catch_warnings, simplefilter\n",
    "\n",
    "df = DataFrame({'letters': [*'abc'], 'values': [1,2,3]})\n",
    "\n",
    "with timed('growing rows'):\n",
    "\tfor i in range(df.index.max(), df.index.max() + 200):\n",
    "\t\tdf.loc[i] = ['d', 4]\n",
    "\n",
    "df = DataFrame({'letters': [*'abc'], 'values': [1,2,3]})\n",
    "with (catch_warnings(), timed('growing columns')):\n",
    "\tsimplefilter('ignore')\n",
    "\tfor i in range(200):\n",
    "\t\tdf[i] = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27182e94",
   "metadata": {},
   "source": [
    "#### Under the Hood: BlockManager?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8dade1c3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "from _lib import timed\n",
    "from warnings import catch_warnings, simplefilter\n",
    "\n",
    "df = DataFrame({\n",
    "\t'letters': [*'abc'], 'ints': [1,2,3], 'floats': [1.0, 2.0, 3.0]\n",
    "})\n",
    "# df['another column'] = 1\n",
    "\n",
    "print(df._mgr)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46cd5c8a",
   "metadata": {},
   "source": [
    "Further Reading\n",
    "- [Wide & Long Data Formats in pandas](https://www.dontusethiscode.com/blog/2024-11-20_pandas_reshaping.html)\n",
    "- [The pandas reshaping functions you didn’t know about](https://www.dontusethiscode.com/blog/2024-12-04_pandas_reshaping2.html)\n",
    "- [pandas Equality Joins](https://www.dontusethiscode.com/blog/2024-04-17_joins.html)\n",
    "- [DataFrame Joins and Multisets](https://www.dontusethiscode.com/blog/2024-04-24_joins_equality_dup.html) \n",
    "- [pandas concat & index alignment](https://www.dontusethiscode.com/blog/2024-07-24_pandas-concat.html)\n",
    "- [DataFrame inequality & asof joins](https://www.dontusethiscode.com/blog/2024-07-10_ineq_joins.html)\n",
    "\n",
    "# Concepts Review\n",
    "\n",
    "**Why use pandas**\n",
    "- Restricted Computation Domain\n",
    "- Index alignment happens EVERYWHERE.\n",
    "\n",
    "**Simple analysis**\n",
    "- Loading data from files & in-memory\n",
    "- indexing/slicing/filtering\n",
    "- simple data exploration\n",
    "    - aggregations\n",
    "- simple plots\n",
    "\n",
    "**Data Cleaning**\n",
    "- column name normalization\n",
    "- mixed types: casting to specific dtypes\n",
    "- missing data\n",
    "- duplicates\n",
    "- unnecessary columns/rows\n",
    "- entity normalization\n",
    "\n",
    "**Grouped Operations**\n",
    "- apply, agg, transform\n",
    "    - use specific verbs (agg/transform > apply)\n",
    "    - avoid 1 function call per group\n",
    "\n",
    "**Reshaping Data**\n",
    "- wide vs long\n",
    "- melt         → wide to long\n",
    "- pivot        → long to wide\n",
    "- pivot_table  → long to wide w/ agg\n",
    "- stack        → wide to long (via the index)\n",
    "- unstack      → long to wide (via the index)\n",
    "- lreshape     → generic wide to long\n",
    "- wide_to_long → patterned widt to long\n",
    "\n",
    "## Working with multiple DataFrames\n",
    "- join\n",
    "- join_asof\n",
    "- merge\n",
    "- concat\n",
    "- inequality join?\n",
    "\n",
    "## Standard Reshaping Functions\n",
    "\n",
    "- melt        → wide to long\n",
    "- pivot       → long to wide\n",
    "- pivot_table → long to wide w/ agg\n",
    "- stack       → wide to long\n",
    "- unstack     → long to wide\n",
    "\n",
    "## Convenience Reshaping Functions\n",
    "- lreshape\n",
    "- wide_to_long\n",
    "\n",
    "## Working with multiple DataFrames\n",
    "- merge\n",
    "- join\n",
    "- join_asof\n",
    "- concat\n",
    "- inequality join?\n",
    "\n",
    "# Hard(er) Problems"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd5a4f9a",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Let's Get Started!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7421a45a",
   "metadata": {},
   "source": [
    "## Scrabble Anyone?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0382a54e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "\n",
    "from pandas import DataFrame, Series\n",
    "from string import ascii_lowercase\n",
    "\n",
    "words = ['hello', 'world', 'test', 'python', 'think']\n",
    "tiles = [*'pythnoik']\n",
    "\n",
    "# What words can I play?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32d02221",
   "metadata": {},
   "source": [
    "## Watch your transactions!\n",
    "\n",
    "You have your own credit card transactions from the past 3 years and want to\n",
    "analyze these data to investigate your own spending habits.\n",
    "\n",
    "The only caveat is that we have 3 different credit cards, and each creditor\n",
    "has a different structure for their transactional reports.\n",
    "\n",
    "### Data Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "635ede12",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy.random import default_rng\n",
    "from pandas import date_range, DataFrame, Timestamp, to_timedelta\n",
    "\n",
    "category_map = {\n",
    "    \"Dining\": [\n",
    "      \"The Coffee Spot\", \"Quick Bites Grill\", \"JavaPoint Café\"\n",
    "    ],\n",
    "    \"Grocery\": [\n",
    "      \"Grocery Depot\", \"Green Earth Grocers\", \"CityMart\"\n",
    "    ],\n",
    "    \"Electronics\": [\"Tech Haven\", \"Smart Electronics\"],\n",
    "    \"Travel\": [\"Luxe Hotels\", \"TravelEase Agency\"],\n",
    "    \"Shopping\": [\n",
    "      \"Urban Outfitters\", \"Fashion Forward\", \"HighEnd Fashions\", \"Home Essentials\"\n",
    "    ],\n",
    "    \"Fitness\": [\n",
    "      \"Fitness World\", \"Mountain Gear\"\n",
    "    ],\n",
    "    \"Books\": [\"Global Books\"],\n",
    "    \"Gas\": [\"Speedy Gas\"]\n",
    "}\n",
    "\n",
    "average_transactions = {\n",
    "    \"The Coffee Spot\": 15,\n",
    "    \"Quick Bites Grill\": 20,\n",
    "    \"JavaPoint Café\": 12,\n",
    "    \"Grocery Depot\": 50,\n",
    "    \"Green Earth Grocers\": 45,\n",
    "    \"CityMart\": 60,\n",
    "    \"Tech Haven\": 500,\n",
    "    \"Smart Electronics\": 300,\n",
    "    \"Luxe Hotels\": 1200,\n",
    "    \"TravelEase Agency\": 800,\n",
    "    \"Urban Outfitters\": 100,\n",
    "    \"Fashion Forward\": 150,\n",
    "    \"HighEnd Fashions\": 200,\n",
    "    \"Home Essentials\": 80,\n",
    "    \"Fitness World\": 60,\n",
    "    \"Mountain Gear\": 120,\n",
    "    \"Global Books\": 25,\n",
    "    \"Speedy Gas\": 40\n",
    "}\n",
    "\n",
    "def gen_finix(rng):\n",
    "    transactions = []\n",
    "    for category, merchants in category_map.items():\n",
    "        for merchant in merchants:\n",
    "            avg_amount = average_transactions[merchant]\n",
    "            for _ in range(rng.integers(50, 100)):\n",
    "                transaction_date = Timestamp('2020-01-01') + to_timedelta(rng.integers(0, 365*3), unit='d')\n",
    "                transactions.append({\n",
    "                    \"store\": merchant,\n",
    "                    \"category\": category,\n",
    "                    \"date\": f\"{transaction_date:%m/%d/%Y}\",\n",
    "                    \"amount\": rng.normal(avg_amount, avg_amount * .2) * rng.choice([1, -1], p=[.99, .01]),\n",
    "                })\n",
    "    return DataFrame.from_records(transactions)\n",
    "\n",
    "def gen_credo(rng):\n",
    "    transactions = []\n",
    "    for category, merchants in category_map.items():\n",
    "        for merchant in merchants:\n",
    "            avg_amount = average_transactions[merchant]\n",
    "            for _ in range(rng.integers(50, 100)):\n",
    "                transaction_date = Timestamp('2020-01-01') + to_timedelta(rng.integers(0, 365*3), unit='d')\n",
    "                transaction_type = rng.choice(['purchase', 'refund'], p=[.98, .02])\n",
    "                transactions.append({\n",
    "                    \"Merchant\": merchant.replace(' ', ''),\n",
    "                    \"Category\": category.lower(),\n",
    "                    \"transaction_Date\": f\"{transaction_date:%Y-%m-%d}\",\n",
    "                    \"posted_Date\": f\"{transaction_date + to_timedelta(rng.integers(0, 2), unit='d'):%Y-%m-%d}\",\n",
    "                    transaction_type: rng.normal(avg_amount, avg_amount * .2),\n",
    "                })\n",
    "    return DataFrame.from_records(transactions)\n",
    "\n",
    "def gen_axpy(rng):\n",
    "    transactions = []\n",
    "    for category, merchants in category_map.items():\n",
    "        for merchant in merchants:\n",
    "            avg_amount = average_transactions[merchant]\n",
    "            for _ in range(rng.integers(80, 200)):\n",
    "                transaction_date = Timestamp('2020-01-01') + to_timedelta(rng.integers(0, 365*3), unit='d')\n",
    "                transaction_type = rng.choice(['purchase', 'refund'], p=[.98, .02])\n",
    "                transactions.append({\n",
    "                    \"Merchant\": merchant.replace(' ', ''),\n",
    "                    \"Category\": category,\n",
    "                    \"Date\": f\"{transaction_date:%Y-%m-%d}\",\n",
    "                    \"Type\": transaction_type,\n",
    "                    \"Amount\": rng.normal(avg_amount, avg_amount * .2),\n",
    "                })\n",
    "    return DataFrame.from_records(transactions)\n",
    "\n",
    "\n",
    "rng = default_rng(0)\n",
    "frames = {\n",
    "    'finix': gen_finix(rng),\n",
    "    'credo': gen_credo(rng),\n",
    "    'axpy':  gen_axpy(rng),\n",
    "}\n",
    "\n",
    "from pathlib import Path\n",
    "from pandas import to_datetime\n",
    "p = Path('data', 'transactions')\n",
    "p.mkdir(exist_ok=True)\n",
    "\n",
    "for name, df in frames.items():\n",
    "    if name == 'axpy':\n",
    "        dump_path = p / name\n",
    "        dump_path.mkdir(exist_ok=True)\n",
    "        for dt, group in df.groupby(df.pop('Date')):\n",
    "            group.to_csv(dump_path / f'{dt}.csv', index=False)\n",
    "    else:\n",
    "        df.to_csv(p / f'{name}.csv', index=False)\n",
    "    print(name, df.head(), sep='\\n', end='\\n{}\\n'.format('-' * 40))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f651f38b",
   "metadata": {},
   "source": [
    "Running the above code results in a file structure like…\n",
    "\n",
    "```\n",
    "transactions/\n",
    "├── credo.csv\n",
    "├── finix.csv\n",
    "└── axpy/\n",
    "    ├── 2020-01-01.csv\n",
    "    ├── 2020-01-02.csv\n",
    "    ├── 2020-01-03.csv\n",
    "    ├── 2020-01-04.csv\n",
    "    ├── 2020-01-05.csv\n",
    "    ...\n",
    "```\n",
    "\n",
    "We want to determine...\n",
    "1. How much did we receive in refunds? How much did we spend?\n",
    "2. What is our favorite merchant to visit?\n",
    "3. What is our highest spending category across all cards?\n",
    "4. How much did we spend each month?\n",
    "    - What is our most expensive month?\n",
    "    - What is the most expensive month, averaging across years?\n",
    "5. What day did we make the most transactions?\n",
    "    - What day did we visit the most unique merchants?\n",
    "6. Were there any days that we did not make a transaction?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bcd5ce1d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "1a116e08",
   "metadata": {},
   "source": [
    "## Star Trader\n",
    "\n",
    "### Data Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2555cc89",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy.random import default_rng\n",
    "from pandas import DataFrame, MultiIndex, date_range, Series, to_timedelta, IndexSlice, CategoricalIndex\n",
    "from pathlib import Path\n",
    "from sys import exit\n",
    "from scipy.stats import skewnorm\n",
    "import sys; sys.breakpointhook = exit\n",
    "\n",
    "data_dir = Path('data')\n",
    "data_dir.mkdir(exist_ok=True, parents=True)\n",
    "\n",
    "full_dates = date_range('2020-01-01', periods=180*2, freq='D')\n",
    "dates = date_range('2020-01-01', periods=180, freq='D')\n",
    "\n",
    "assets = '''\n",
    "    Equipment Medicine Metals Software StarGems Uranium\n",
    "'''.split()\n",
    "assets = CategoricalIndex(assets)\n",
    "\n",
    "stars = '''\n",
    "    Sol\n",
    "    Boyd Fate Gaol Hook Ivan Kirk Kris Quin\n",
    "    Reef Sand Sink Stan Task York\n",
    "'''.split()\n",
    "stars = CategoricalIndex(stars)\n",
    "\n",
    "players = '''\n",
    "    Alice Bob Charlie Dana\n",
    "'''.split()\n",
    "players = CategoricalIndex(players)\n",
    "\n",
    "rng = default_rng(0)\n",
    "\n",
    "inventory = (\n",
    "    Series(\n",
    "        index=(idx :=\n",
    "            MultiIndex.from_product([\n",
    "                players,\n",
    "                assets,\n",
    "            ], names='player asset'.split())\n",
    "        ),\n",
    "        data=rng.normal(loc=1, scale=.25, size=len(idx)),\n",
    "        name='volume',\n",
    "    ) * Series({\n",
    "        'Equipment': 1_000,\n",
    "        'Medicine':    500,\n",
    "        'Metals':    1_250,\n",
    "        'Software':    350,\n",
    "        'StarGems':      5,\n",
    "        'Uranium':      50,\n",
    "    }, name='volume').rename_axis(index='asset')\n",
    ").pipe(lambda s:\n",
    "    s\n",
    "        .sample(len(s) - 3, random_state=rng)\n",
    "        .sort_index()\n",
    ").pipe(lambda s:\n",
    "    s\n",
    "        .astype('int')\n",
    "        .reindex(idx)\n",
    "        .astype('Int64')\n",
    ")\n",
    "\n",
    "base_prices = Series({\n",
    "    'Equipment':    7,\n",
    "    'Medicine':    40,\n",
    "    'Metals':       3,\n",
    "    'Software':    20,\n",
    "    'StarGems': 1_000,\n",
    "    'Uranium':    500,\n",
    "}, name='price').rename_axis('asset')\n",
    "\n",
    "price_shifts = (\n",
    "    Series(\n",
    "        index=(idx :=\n",
    "            MultiIndex.from_product([\n",
    "                full_dates,\n",
    "                stars,\n",
    "                assets,\n",
    "            ], names='date star asset'.split())\n",
    "        ),\n",
    "        data=(\n",
    "            rng.normal(loc=1, scale=0.05, size=(len(stars), len(assets))).clip(0, 1.5)\n",
    "            *\n",
    "            rng.normal(loc=1, scale=0.02, size=(len(full_dates), len(stars), len(assets))).clip(0, 1.5).cumprod(axis=0)\n",
    "        ).ravel(),\n",
    "        name='price',\n",
    "    )\n",
    ")\n",
    "spreads = (\n",
    "    Series(\n",
    "        index=(idx :=\n",
    "            MultiIndex.from_product([\n",
    "                full_dates,\n",
    "                stars,\n",
    "                assets,\n",
    "            ], names='date star asset'.split())\n",
    "        ),\n",
    "        data=skewnorm(a=1, loc=.02, scale=.01).rvs(len(idx), random_state=rng).clip(-0.01, +.05),\n",
    "        name='price',\n",
    "    )\n",
    ")\n",
    "\n",
    "market = DataFrame({\n",
    "    'buy':  base_prices * price_shifts * (1 + spreads),\n",
    "    'sell': base_prices * price_shifts,\n",
    "}).rename_axis(columns='direction').pipe(\n",
    "    lambda df: df.set_axis(\n",
    "        df.columns.astype('category'),\n",
    "        axis='columns',\n",
    "    )\n",
    ")\n",
    "\n",
    "loc_ps = {\n",
    "    pl: (p := rng.integers(10, size=len(stars))) / p.sum()\n",
    "    for pl in players\n",
    "}\n",
    "locations = (\n",
    "    DataFrame(\n",
    "        index=(idx := dates),\n",
    "        data={\n",
    "            pl: rng.choice(stars, p=loc_ps[pl], size=len(idx))\n",
    "            for pl in players\n",
    "        },\n",
    "    )\n",
    "    .rename_axis(index='date', columns='player')\n",
    "    .pipe(lambda s:\n",
    "        s\n",
    "        .set_axis(\n",
    "            s.columns.astype(players.dtype),\n",
    "            axis='columns',\n",
    "        )\n",
    "        .astype(\n",
    "              stars.dtype,\n",
    "        )\n",
    "    )\n",
    "    .stack('player')\n",
    "    .rename('star')\n",
    "    .pipe(\n",
    "        lambda s: s\n",
    "            .sample(frac=.75, random_state=rng)\n",
    "            .reindex(s.index)\n",
    "            .groupby('player', observed=False).ffill()\n",
    "            .groupby('player', observed=False).bfill()\n",
    "            .sort_index()\n",
    "    )\n",
    ")\n",
    "\n",
    "trips = (\n",
    "    locations.groupby('player', group_keys=False, observed=False).apply(\n",
    "        lambda g: g[g != g.shift()]\n",
    "    ).sort_index()\n",
    ")\n",
    "\n",
    "standard_volumes = (10_000 / base_prices).round(0)\n",
    "\n",
    "trades = (\n",
    "    DataFrame(\n",
    "        index=(idx :=\n",
    "            MultiIndex.from_product([\n",
    "                dates,\n",
    "                players,\n",
    "                assets,\n",
    "                range(10),\n",
    "            ], names='date player asset trade#'.split())\n",
    "        ),\n",
    "        data={\n",
    "            'sentiment': rng.normal(loc=0, scale=.025, size=len(idx)),\n",
    "            'regret': rng.normal(loc=0, scale=.0005, size=len(idx)),\n",
    "            'edge': rng.normal(loc=1, scale=.001, size=len(idx)).clip(.75, 1.25),\n",
    "        },\n",
    "    )\n",
    "    .pipe(\n",
    "        lambda df: df\n",
    "            .assign(\n",
    "                buy=lambda df: (df.groupby(['player', 'asset'], observed=False)['sentiment'].rolling(3).mean() > 0).values,\n",
    "                sign=lambda df: df['buy'] * -1 + ~df['buy'],\n",
    "                direction=lambda df: df['buy'].map({True: 'buy', False: 'sell'}).astype(market.columns.dtype),\n",
    "                volume=lambda df: df['sign'] * rng.normal(loc=1, scale=.5, size=len(df)).clip(0, 2),\n",
    "            )\n",
    "            .assign(\n",
    "                star=lambda df:\n",
    "                    locations.loc[\n",
    "                        MultiIndex.from_arrays([\n",
    "                            df.index.get_level_values('date'),\n",
    "                            df.index.get_level_values('player'),\n",
    "                        ])\n",
    "                    ].values,\n",
    "                asset_price=lambda df: (\n",
    "                    market.stack('direction').loc[\n",
    "                        MultiIndex.from_arrays([\n",
    "                            df.index.get_level_values('date'),\n",
    "                            df['star'],\n",
    "                            df.index.get_level_values('asset'),\n",
    "                            df['direction'],\n",
    "                        ])\n",
    "                    ].values\n",
    "                ),\n",
    "                price=lambda df: (df['asset_price'] * df['edge']).round(4),\n",
    "                mark=lambda df: (df['price'] * (1 + df['regret'])).round(4),\n",
    "                volume=lambda df: (df['volume'] * standard_volumes).round(-1).astype(int),\n",
    "            )\n",
    "    )\n",
    "    .pipe(\n",
    "        lambda df: df\n",
    "            .loc[lambda df: df['volume'] != 0]\n",
    "            .sample(frac=.5, random_state=rng)\n",
    "            .sort_index()\n",
    "    )\n",
    ")\n",
    "\n",
    "data_dir = Path('data', 'startrader')\n",
    "data_dir.mkdir(exist_ok=True, parents=True)\n",
    "\n",
    "marks = trades['mark'].groupby(['date', 'player', 'asset'], observed=False).last()\n",
    "\n",
    "market.to_pickle(data_dir / 'market.pkl')\n",
    "trips.to_pickle(data_dir / 'trips.pkl')\n",
    "inventory.to_pickle(data_dir / 'inventory.pkl')\n",
    "locations.to_pickle(data_dir / 'locations.pkl')\n",
    "trades.loc[:, ['volume', 'star', 'price']].assign(\n",
    "    price=lambda df: df['price'].round(2),\n",
    ").to_pickle(data_dir / 'trades.pkl')\n",
    "\n",
    "print('done!')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2e6d2f3",
   "metadata": {},
   "source": [
    "### Inventory only"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4e7a5d88",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import read_pickle\n",
    "inventory = read_pickle('data/startrader/inventory.pkl')\n",
    "\n",
    "# Who had the most Uranium? What about second most?\n",
    "# The most of each asset?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c788ffba",
   "metadata": {},
   "source": [
    "### Inventory & Market"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4375d046",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import read_pickle\n",
    "inventory = read_pickle('data/startrader/inventory.pkl')\n",
    "market = read_pickle('data/startrader/market.pkl')\n",
    "\n",
    "print(inventory.head(), market.head(), sep='\\n' * 2)\n",
    "\n",
    "# what is everyone’s inventory worth on Sol?\n",
    "# what is everyone’s inventory worth on everywhere?\n",
    "# where is everyones inventory worth the most?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d63c9b03",
   "metadata": {},
   "source": [
    "### Inventory & Trades"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c85fb25a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import read_pickle\n",
    "inventory = read_pickle('data/startrader/inventory.pkl')\n",
    "trades = read_pickle('data/startrader/trades.pkl')\n",
    "\n",
    "# what is maximum number of Star Gems each player had, and when were those maximums observed?\n",
    "# did Alice every have more Star Gems than Bob? If so, when?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e68a853e",
   "metadata": {},
   "source": [
    "### Trips"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2aaefa6",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "outputs": [],
   "source": [
    "from pandas import read_pickle\n",
    "from pandas import Grouper\n",
    "trips = read_pickle('data/startrader/trips.pkl')\n",
    "\n",
    "# Where was Alice at on 2024-01-02?\n",
    "print(\n",
    "    trips.head(8),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c885eb83",
   "metadata": {},
   "source": [
    "\n"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "main_language": "python",
   "notebook_metadata_filter": "-all"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }