jirislav · May 16, 2022 07:03
diff --git a/apache-spark-in-practice-it-academy-2022.ipynb b/apache-spark-in-practice-it-academy-2022.ipynb
 {
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Apache Spark in practice - IT Academy 2022.ipynb",
      "provenance": [],
      "toc_visible": true,
      "authorship_tag": "ABX9TyPfHLdBytG5kyszJYAfkiZq",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/jirislav/173dc2ba48959753a124b11f1719bafa/apache-spark-in-practice-it-academy-2022.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# What you'll put your hands on in this notebook\n",
        "\n",
        "- reading data from the Parquet format\n",
        "- linking two separate data sets together based on a common field\n",
        "- writing simple aggregation"
      ],
      "metadata": {
        "id": "eVmi7UboZsSf"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Preparing the data & Spark session"
      ],
      "metadata": {
        "id": "Sarump79aV6L"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Following snippet will download our example dataset which you'll be working with:"
      ],
      "metadata": {
        "id": "m6PfAenLaZnJ"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 25,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "zhmtPsYHX7xX",
        "outputId": "c1dfcf45-9c2b-4e51-9cce-bc4cbc889f87"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2022-05-16 07:00:41--  https://github.com/seznam/IT-akademie-bigdata/raw/main/big-data/data/example-dataset.tar.xz\n",
            "Resolving github.com (github.com)... 140.82.121.3\n",
            "Connecting to github.com (github.com)|140.82.121.3|:443... connected.\n",
            "HTTP request sent, awaiting response... 302 Found\n",
            "Location: https://raw.githubusercontent.com/seznam/IT-akademie-bigdata/main/big-data/data/example-dataset.tar.xz [following]\n",
            "--2022-05-16 07:00:42--  https://raw.githubusercontent.com/seznam/IT-akademie-bigdata/main/big-data/data/example-dataset.tar.xz\n",
            "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...\n",
            "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 7190296 (6.9M) [application/octet-stream]\n",
            "Saving to: ‘example-dataset.tar.xz’\n",
            "\n",
            "example-dataset.tar 100%[===================>]   6.86M  --.-KB/s    in 0.04s   \n",
            "\n",
            "2022-05-16 07:00:42 (181 MB/s) - ‘example-dataset.tar.xz’ saved [7190296/7190296]\n",
            "\n",
            "total 273284\n",
            "drwxrwxr-x  4 1000 1000      4096 May 16 05:40 example-dataset\n",
            "-rw-r--r--  1 root root   7190296 May 16 07:00 example-dataset.tar.xz\n",
            "drwxr-xr-x  1 root root      4096 May  3 13:42 sample_data\n",
            "drwxr-xr-x 13  501 1000      4096 Jan 20 20:52 spark-3.2.1-bin-hadoop2.7\n",
            "-rw-r--r--  1 root root 272637746 Jan 20 21:37 spark-3.2.1-bin-hadoop2.7.tgz\n"
          ]
        }
      ],
      "source": [
        "!test -f example-dataset.tar.xz || wget https://github.com/seznam/IT-akademie-bigdata/raw/main/big-data/data/example-dataset.tar.xz\n",
        "!test -d example-dataset || tar -xf example-dataset.tar.xz\n",
        "!ls -l"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now let's install Spark on PySpark:"
      ],
      "metadata": {
        "id": "G8XblEIbafX2"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Install Spark\n",
        "\n",
        "import os\n",
        "os.chdir(\"/content\")\n",
        "!test -f spark-3.2.1-bin-hadoop2.7.tgz || wget https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz\n",
        "!test -d spark-3.2.1-bin-hadoop2.7 || tar -xf spark-3.2.1-bin-hadoop2.7.tgz\n",
        "\n",
        "# Setup pyspark\n",
        "!pip install findspark\n",
        "import findspark\n",
        "os.environ[\"SPARK_HOME\"] = \"/content/spark-3.2.1-bin-hadoop2.7\"\n",
        "findspark.init()\n",
        "\n",
        "# Create new SparkSession\n",
        "from pyspark.sql import SparkSession\n",
        "spark = SparkSession.builder \\\n",
        "        .master(\"local[*]\") \\\n",
        "        .getOrCreate()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "zzn3RQlrauVY",
        "outputId": "8df5f747-c0b5-45cd-ac9a-e9a8e7b3caba"
      },
      "execution_count": 10,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2022-05-16 06:12:41--  https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz\n",
            "Resolving archive.apache.org (archive.apache.org)... 138.201.131.134, 2a01:4f8:172:2ec5::2\n",
            "Connecting to archive.apache.org (archive.apache.org)|138.201.131.134|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 272637746 (260M) [application/x-gzip]\n",
            "Saving to: ‘spark-3.2.1-bin-hadoop2.7.tgz’\n",
            "\n",
            "spark-3.2.1-bin-had 100%[===================>] 260.01M  7.06MB/s    in 47s     \n",
            "\n",
            "2022-05-16 06:13:28 (5.48 MB/s) - ‘spark-3.2.1-bin-hadoop2.7.tgz’ saved [272637746/272637746]\n",
            "\n",
            "Collecting findspark\n",
            "  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)\n",
            "Installing collected packages: findspark\n",
            "Successfully installed findspark-2.0.1\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Just some handy functions to keep the code cells clean later on\n",
        "# Note that we don't use asterisk (*) because then the Colab completion doesn't work\n",
        "from pyspark.sql.functions import col, floor, udf, explode"
      ],
      "metadata": {
        "id": "bL8Ua8jyaztf"
      },
      "execution_count": 16,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Reading the data\n",
        "\n",
        "Now it's time for you to come up with what you've learned from the previous [Introduction to Apache Spark - IT Academy 2022](https://colab.research.google.com/github/seznam/IT-akademie-bigdata/blob/main/big-data/notebooks/001_introduction_to_apache_spark.ipynb) notebook.\n",
        "\n",
        "Let us help you for starters by looking into the structure of the data we have prepared for you."
      ],
      "metadata": {
        "id": "b1CBaSA6a-Jg"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!ls -l example-dataset/"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "bIk9BOmDdRMq",
        "outputId": "25f036d1-ee1a-41ad-b755-e86ead26022d"
      },
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "total 8\n",
            "drwxrwxr-x 2 1000 1000 4096 May  3 16:02 clicks\n",
            "drwxrwxr-x 2 1000 1000 4096 May  3 16:04 impressions\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Okay, we have two directories `clicks` and `impressions`. There were taken from Seznam's ad division, so a click means some user clicked an ad, while an impression means an ad was loaded by the browser and rendered.\n",
        "\n",
        "Let's look into the directories to see the data format:"
      ],
      "metadata": {
        "id": "GyQZylO9dZ6Q"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!ls -l example-dataset/*"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Xbl3haL6fI5e",
        "outputId": "31fad85b-9bd9-40e4-920a-459d8010b739"
      },
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "example-dataset/clicks:\n",
            "total 5376\n",
            "-rw-rw-r-- 1 1000 1000 4065411 May  3 16:03 part-00000-fb446a88-0486-413b-87b9-e9c9f930d3ab-c000.snappy.parquet\n",
            "-rw-rw-r-- 1 1000 1000 1435727 May  3 16:03 part-00001-fb446a88-0486-413b-87b9-e9c9f930d3ab-c000.snappy.parquet\n",
            "-rw-rw-r-- 1 1000 1000       0 May  3 16:03 _SUCCESS\n",
            "\n",
            "example-dataset/impressions:\n",
            "total 4416\n",
            "-rw-rw-r-- 1 1000 1000 1130364 May  3 16:04 part-00000-456bf299-6efb-45f1-905e-2512ed65d1fe-c000.snappy.parquet\n",
            "-rw-rw-r-- 1 1000 1000 1126469 May  3 16:04 part-00001-456bf299-6efb-45f1-905e-2512ed65d1fe-c000.snappy.parquet\n",
            "-rw-rw-r-- 1 1000 1000 1126712 May  3 16:04 part-00002-456bf299-6efb-45f1-905e-2512ed65d1fe-c000.snappy.parquet\n",
            "-rw-rw-r-- 1 1000 1000 1128897 May  3 16:04 part-00003-456bf299-6efb-45f1-905e-2512ed65d1fe-c000.snappy.parquet\n",
            "-rw-rw-r-- 1 1000 1000       0 May  3 16:04 _SUCCESS\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "As you can see, the data are using Parquet format. And here comes your first task:\n",
        "- load clicks parquet directory into one DataFrame (named `clicks`)\n",
        "- and impressions directory into other DataFrame (named `impressions`)\n",
        "\n",
        "*HINT: [PySpark Documentation](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameReader.parquet.html?highlight=parquet) might be handy if autocompletion does not suffice*"
      ],
      "metadata": {
        "id": "jwNmhIGRfXz7"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "clicks = "
      ],
      "metadata": {
        "id": "gvDBhvbSftw0"
      },
      "execution_count": 12,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "impressions = "
      ],
      "metadata": {
        "id": "M_gUvrgCgLTl"
      },
      "execution_count": 13,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's see the schema and the data to verify you have loaded it successfully"
      ],
      "metadata": {
        "id": "0dU8s1cggPVl"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "clicks.printSchema()\n",
        "clicks.show()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "V06r_CJzgVoE",
        "outputId": "9f8e5a58-199d-46d5-a444-e3291f944021"
      },
      "execution_count": 14,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "root\n",
            " |-- Timestamp: long (nullable = true)\n",
            " |-- ImpressionTimestamp: long (nullable = true)\n",
            " |-- AdId: integer (nullable = true)\n",
            " |-- KeywordId: long (nullable = true)\n",
            " |-- RandomId: long (nullable = true)\n",
            "\n",
            "+----------------+-------------------+---------+----------+----------+\n",
            "|       Timestamp|ImpressionTimestamp|     AdId| KeywordId|  RandomId|\n",
            "+----------------+-------------------+---------+----------+----------+\n",
            "|1649307603675276|   1649307595649634|435947651|      null|2062901616|\n",
            "|1649307603734501|   1649307585109683|554794504|      null|3257733838|\n",
            "|1649307604043892|   1649307589191166|575725918|      null|2184303841|\n",
            "|1649307604149413|   1649307559129589|575725918|      null|1839752057|\n",
            "|1649307606070591|   1649306613659472|571896132|      null|3358643236|\n",
            "|1649307607675483|   1649307604482729|     null|      null|1556358807|\n",
            "|1649307608554653|   1649307563750753|575351159|      null|3988672169|\n",
            "|1649307609072961|   1649307559897859|573836983|      null|1161004805|\n",
            "|1649307609507554|   1649307558317625|575351159|      null|1918168565|\n",
            "|1649307610471479|   1649307571801352|576071622|      null|3903568246|\n",
            "|1649307614983078|   1649307482608006|573442045|      null| 954869706|\n",
            "|1649307615180341|   1649307582711864|562940244|2562368594|1526164816|\n",
            "|1649307616337862|   1649307612356135|571327565|      null| 669564168|\n",
            "|1649307617820968|   1649307603704971|481783083|2322533803| 473825180|\n",
            "|1649307618353567|   1649307614632151|574151611|      null|1424195132|\n",
            "|1649307620639791|   1649307617169654|218269207|2292385741| 332260567|\n",
            "|1649307620928289|   1649307542460830|575549769|      null|4031402155|\n",
            "|1649307621861192|   1649307538545537|562668942|      null|2468609898|\n",
            "|1649307621938796|   1649307614032563|561767054|      null|2249586041|\n",
            "|1649307622202056|   1649307455133216|529127966|      null|2333302037|\n",
            "+----------------+-------------------+---------+----------+----------+\n",
            "only showing top 20 rows\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "impressions.printSchema()\n",
        "impressions.show()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "vZ4YQhIigZKR",
        "outputId": "79525931-76a4-4599-f64b-a859892b72ad"
      },
      "execution_count": 15,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "root\n",
            " |-- bidTimestamp: long (nullable = true)\n",
            " |-- timestamp: long (nullable = true)\n",
            " |-- requestId: string (nullable = true)\n",
            " |-- randomIds: array (nullable = true)\n",
            " |    |-- element: long (containsNull = true)\n",
            "\n",
            "+----------------+----------------+--------------------+--------------------+\n",
            "|    bidTimestamp|       timestamp|           requestId|           randomIds|\n",
            "+----------------+----------------+--------------------+--------------------+\n",
            "|1649282455048002|1649282455048002|347f3b79-57d9-4b3...|        [3783526641]|\n",
            "|1649282808972936|1649282808972936|3de24cdf-d2a6-4d1...|[1352518872, 1036...|\n",
            "|1649282887994142|1649282887994142|dfbb298c-239c-4ae...|         [990130250]|\n",
            "|1649282987862427|1649282987862427|3e518073-1612-4b5...|        [4055324047]|\n",
            "|1649283009547800|1649283009547800|8b3bc895-03a1-4f7...|        [1267857419]|\n",
            "|1649283101173985|1649283101173985|8ccfa691-c760-47e...|        [1544212150]|\n",
            "|1649283335899174|1649283335899174|bfa1a0e8-bb15-465...|        [2693591608]|\n",
            "|1649283409904003|1649283409904003|53da9ce0-d6dc-42b...|        [2031823830]|\n",
            "|1649283459230688|1649283459230688|98dedfb7-48fe-447...|         [990490641]|\n",
            "|1649283485210091|1649283485210091|a348f243-244f-46b...|         [418009298]|\n",
            "|1649283490049409|1649283490049409|995b0e66-9d98-420...|        [1404178780]|\n",
            "|1649283694328884|1649283694328884|b874cdc1-3aad-435...|        [1758249415]|\n",
            "|1649283732585322|1649283732585322|aa0fb53a-85e7-48b...|        [3245073081]|\n",
            "|1649283779706411|1649283779706411|ff0c30a3-6769-461...|         [313257395]|\n",
            "|1649283781046058|1649283781046058|1cc3ed93-e697-4c5...|[412534125, 23264...|\n",
            "|1649284053383566|1649284053383566|788ecd76-74cb-4e9...|         [341351989]|\n",
            "|1649284144662014|1649284144662014|e9ace95a-fb6c-410...|         [579766735]|\n",
            "|1649284173406769|1649284173406769|813daab0-597b-410...|        [2672214130]|\n",
            "|1649284193027211|1649284193027211|20e95fb9-7be3-4c6...|        [1586066628]|\n",
            "|1649284340340046|1649284340340046|e2facbc4-c122-4a5...|        [3254386147]|\n",
            "+----------------+----------------+--------------------+--------------------+\n",
            "only showing top 20 rows\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Great! At this point, you're ready to go on to the second part."
      ],
      "metadata": {
        "id": "hYf0m_wXgj6R"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Linking two datasets together"
      ],
      "metadata": {
        "id": "KayA17Xngpsp"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "As you can see from the schemas of both DataFrames, there are some fields, which we can use to link the data together.\n",
        "\n",
        "One of them is `click.ImpressionTimestamp` column, which should correlate to `impression.timestamp` column. We will use this to link the two datasets together.\n",
        "\n",
        "The second one is a bit tricky to understand, because first you need to understand, what is an impression within our dataset.\n",
        "- single line of `impressions` DataFrame represents a single rendering result from the browser, but it potentially includes more ads at once, which is why there is an column named `randomIds`, which is actually an array of integers.\n",
        "- each `randomId` can be considered as a single ad\n",
        "- since one line of `clicks` DataFrame represents a single click on a single ad, we can link `impressions` to `clicks` only after we *explode* our `randomIds` array\n",
        "  - what do we mean by *exploding* the array?\n",
        "  - well, it's like flattening the structure, so that we get \"more lines\" in the DataFrame at the end, so that we have one impressions with exactly one randomId\n",
        "\n",
        "That said, we now know we can also link `click.RandomId` field with exploded `impression.randomIds[]` array."
      ],
      "metadata": {
        "id": "ZXEjR06LgvXC"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "See function docs you should use:\n",
        "- https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.withColumn.html\n",
        "- https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.explode.html"
      ],
      "metadata": {
        "id": "YEeTa8a8kjYm"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "exploded_impressions = "
      ],
      "metadata": {
        "id": "M-QDlRT2jk5e"
      },
      "execution_count": 17,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Before we proceed to joining two data frames, we need to unify correlated column names\n",
        "exploded_impressions = exploded_impressions \\\n",
        "                        .drop('randomIds') \\\n",
        "                        .withColumnRenamed('timestamp', 'impressionTimestamp')"
      ],
      "metadata": {
        "id": "zx4ODX4pk7zY"
      },
      "execution_count": 18,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's see what we've got:"
      ],
      "metadata": {
        "id": "LyJMqvf-l70E"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "exploded_impressions.printSchema()\n",
        "exploded_impressions.show()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "mvsNXh0al20D",
        "outputId": "32e9647c-7d36-4887-e356-8c1f82c3b633"
      },
      "execution_count": 21,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "root\n",
            " |-- bidTimestamp: long (nullable = true)\n",
            " |-- impressionTimestamp: long (nullable = true)\n",
            " |-- requestId: string (nullable = true)\n",
            " |-- randomId: long (nullable = true)\n",
            "\n",
            "+----------------+-------------------+--------------------+----------+\n",
            "|    bidTimestamp|impressionTimestamp|           requestId|  randomId|\n",
            "+----------------+-------------------+--------------------+----------+\n",
            "|1649282455048002|   1649282455048002|347f3b79-57d9-4b3...|3783526641|\n",
            "|1649282808972936|   1649282808972936|3de24cdf-d2a6-4d1...|1352518872|\n",
            "|1649282808972936|   1649282808972936|3de24cdf-d2a6-4d1...|1036945266|\n",
            "|1649282887994142|   1649282887994142|dfbb298c-239c-4ae...| 990130250|\n",
            "|1649282987862427|   1649282987862427|3e518073-1612-4b5...|4055324047|\n",
            "|1649283009547800|   1649283009547800|8b3bc895-03a1-4f7...|1267857419|\n",
            "|1649283101173985|   1649283101173985|8ccfa691-c760-47e...|1544212150|\n",
            "|1649283335899174|   1649283335899174|bfa1a0e8-bb15-465...|2693591608|\n",
            "|1649283409904003|   1649283409904003|53da9ce0-d6dc-42b...|2031823830|\n",
            "|1649283459230688|   1649283459230688|98dedfb7-48fe-447...| 990490641|\n",
            "|1649283485210091|   1649283485210091|a348f243-244f-46b...| 418009298|\n",
            "|1649283490049409|   1649283490049409|995b0e66-9d98-420...|1404178780|\n",
            "|1649283694328884|   1649283694328884|b874cdc1-3aad-435...|1758249415|\n",
            "|1649283732585322|   1649283732585322|aa0fb53a-85e7-48b...|3245073081|\n",
            "|1649283779706411|   1649283779706411|ff0c30a3-6769-461...| 313257395|\n",
            "|1649283781046058|   1649283781046058|1cc3ed93-e697-4c5...| 412534125|\n",
            "|1649283781046058|   1649283781046058|1cc3ed93-e697-4c5...|2326499039|\n",
            "|1649284053383566|   1649284053383566|788ecd76-74cb-4e9...| 341351989|\n",
            "|1649284144662014|   1649284144662014|e9ace95a-fb6c-410...| 579766735|\n",
            "|1649284173406769|   1649284173406769|813daab0-597b-410...|2672214130|\n",
            "+----------------+-------------------+--------------------+----------+\n",
            "only showing top 20 rows\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now that we've exploded impressions using `randomIds`, it's time for you to join these DataFrames by `randomId` and  `impressionTimestamp` fields. Please consult the docs:\n",
        "- https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.join.html"
      ],
      "metadata": {
        "id": "s8vLGkpeldud"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "linked_impressions = "
      ],
      "metadata": {
        "id": "jUEtcn_KkRlg"
      },
      "execution_count": 19,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# To avoid ambiguity, rename Timestamp column from clicks\n",
        "linked_impressions = linked_impressions \\\n",
        "                      .withColumnRenamed('Timestamp', 'ClickTimestamp') "
      ],
      "metadata": {
        "id": "TFr90UdgkysW"
      },
      "execution_count": 20,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's see if we got it right:"
      ],
      "metadata": {
        "id": "RBvCwC0ol0ls"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "linked_impressions.printSchema()\n",
        "linked_impressions.show()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "WaH4w5J9mLY-",
        "outputId": "bcadbe31-3f71-4992-fe41-1ebc3cdb47c9"
      },
      "execution_count": 22,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "root\n",
            " |-- ImpressionTimestamp: long (nullable = true)\n",
            " |-- RandomId: long (nullable = true)\n",
            " |-- ClickTimestamp: long (nullable = true)\n",
            " |-- AdId: integer (nullable = true)\n",
            " |-- KeywordId: long (nullable = true)\n",
            " |-- bidTimestamp: long (nullable = true)\n",
            " |-- requestId: string (nullable = true)\n",
            "\n",
            "+-------------------+----------+----------------+---------+---------+----------------+--------------------+\n",
            "|ImpressionTimestamp|  RandomId|  ClickTimestamp|     AdId|KeywordId|    bidTimestamp|           requestId|\n",
            "+-------------------+----------+----------------+---------+---------+----------------+--------------------+\n",
            "|   1649308746783112|2697130188|1649308755993175|575498235|     null|1649308746783112|1ec695a2-2c1e-4eb...|\n",
            "|   1649308025299273|3349023642|1649308043306380|575424121|     null|1649308025299273|5383ca69-0a90-4d9...|\n",
            "|   1649308630214863|3010845523|1649308646358244|575424120|     null|1649308630214863|4ebbab4b-c4a8-49b...|\n",
            "|   1649308629834350|3704989181|1649308658448623|575424121|     null|1649308629834350|539056cc-2b8a-4e0...|\n",
            "|   1649308028320631| 654514729|1649308069925438|575424121|     null|1649308028320631|5c05db8f-055b-433...|\n",
            "|   1649308424514310|1257658759|1649308431493396|575498235|     null|1649308424514310|74a56292-ab96-4f2...|\n",
            "|   1649308156412664| 218319585|1649308171180720|575408874|     null|1649308156412664|67ed6b01-0a3c-4f3...|\n",
            "|   1649309270734173|3520413478|1649309275471048|575498235|     null|1649309270734173|83af483f-4c75-455...|\n",
            "|   1649309345980143|3583421327|1649309371805958|575424119|     null|1649309345980143|6a5b53e6-430e-4a4...|\n",
            "|   1649309238153217|3664227272|1649309272423269|575408876|     null|1649309238153217|717bcece-4383-4a6...|\n",
            "|   1649309120675643|4285140971|1649309167201623|575498238|     null|1649309120675643|cf0f7d7e-2a73-45b...|\n",
            "|   1649308752939690| 393374342|1649308778398707|575424119|     null|1649308752939690|4a012d6f-8aa6-4e9...|\n",
            "|   1649309032630914| 643113148|1649309053128202|575408874|     null|1649309032630914|d91761b7-7e55-438...|\n",
            "|   1649307694763005| 758707646|1649307720603914|575408874|     null|1649307694763005|058ee6a2-52f3-458...|\n",
            "|   1649308491429879|1516546242|1649308492562977|575424121|     null|1649308491429879|fc01261b-7468-41e...|\n",
            "|   1649307709003817|2680720790|1649307749589233|575408876|     null|1649307709003817|23cc9b72-d283-4b5...|\n",
            "|   1649308209978505|  55613846|1649308284034776|575354385|     null|1649308209978505|a1247a6c-6092-432...|\n",
            "|   1649309028727080|2804267224|1649309069119961|575498238|     null|1649309028727080|bce86cb9-8af5-44e...|\n",
            "|   1649307557445721|1799418775|1649307563910797|575498235|     null|1649307557445721|a1da6829-57ec-458...|\n",
            "|   1649306586500520|2382383933|1649306591427737|575498235|     null|1649306586500520|39b4f78a-33cd-435...|\n",
            "+-------------------+----------+----------------+---------+---------+----------------+--------------------+\n",
            "only showing top 20 rows\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Bonus task\n",
        "\n",
        "- create also unlinked DataFrame, which will contain all these impressions or clicks, which were not linked by the other DataFrame"
      ],
      "metadata": {
        "id": "gZ1SY_W8nH07"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Hint: using just \"DataFrame.join\" method is enough\n",
        "\n",
        "unlinked_impressions =\n",
        "\n",
        "unlinked_impressions.printSchema()\n",
        "unlinked_impressions.count()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "KFrZFKvznO-c",
        "outputId": "9eaac88f-a19d-40bb-bb7c-f2f617046635"
      },
      "execution_count": 24,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "root\n",
            " |-- randomId: long (nullable = true)\n",
            " |-- impressionTimestamp: long (nullable = true)\n",
            " |-- bidTimestamp: long (nullable = true)\n",
            " |-- requestId: string (nullable = true)\n",
            " |-- ClickTimestamp: long (nullable = true)\n",
            " |-- AdId: integer (nullable = true)\n",
            " |-- KeywordId: long (nullable = true)\n",
            " |-- bidTimestamp: long (nullable = true)\n",
            " |-- requestId: string (nullable = true)\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "81702"
            ]
          },
          "metadata": {},
          "execution_count": 24
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Performing basic aggregation"
      ],
      "metadata": {
        "id": "Gr3ZaFdunEsx"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Your task here is to create a summary statistics of how many clicks were registered per each `AdId`.\n",
        "\n",
        "For example:\n",
        "\n",
        "| AdId | Clicks |\n",
        "| - | - |\n",
        "| 1002 | 30 |\n",
        "| 586 | 2 |\n",
        "| ... | ... |"
      ],
      "metadata": {
        "id": "hp46e2kmokoa"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "*Hint: Use group by aggregation from the [Introduction to Apache Spark - IT Academy 2022](https://colab.research.google.com/github/seznam/IT-akademie-bigdata/blob/main/big-data/notebooks/001_introduction_to_apache_spark.ipynb) notebook.*"
      ],
      "metadata": {
        "id": "JjnzfNNopH0t"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "clicks_by_ad_id =\n",
        "\n",
        "clicks_by_ad_id.printSchema()\n",
        "clicks_by_ad_id.show()"
      ],
      "metadata": {
        "id": "Upq9NvNAqml2"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "And that's about it!"
      ],
      "metadata": {
        "id": "ll5YRa7dqDes"
      }
    }
  ]
 }