CliffordAnderson · June 30, 2025 12:55
diff --git a/synthetic-data-generation-of-wikipedia-infoboxes.ipynb b/synthetic-data-generation-of-wikipedia-infoboxes.ipynb
 {
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "private_outputs": true,
      "provenance": [],
      "machine_shape": "hm",
      "gpuType": "L4",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/CliffordAnderson/ff326d8e9d657acff82dda380dd3bc3a/synthetic-data-generation-of-wikipedia-infoboxes.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8M4Lq_Eej9ul"
      },
      "source": [
        "# Wikipedia Infobox Generation and Model Fine-tuning\n",
        "\n",
        "This notebook demonstrates how to:\n",
        "1. Load Wikipedia stub articles about women in religion\n",
        "2. Use GPT-4o-mini to generate appropriate infoboxes for these stubs\n",
        "3. Fine-tune a T5 model to learn this stub → infobox transformation\n",
        "4. Publish the resulting dataset and model to Hugging Face Hub\n",
        "\n",
        "The goal is to create a model that can automatically generate Wikipedia infoboxes from article content."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ElX2kqJ0j9un"
      },
      "source": [
        "## Step 1: Install Required Libraries\n",
        "\n",
        "We need:\n",
        "- `datadreamer.dev`: Framework for LLM-powered data generation and model training\n",
        "- `datasets`: Hugging Face library for handling datasets\n",
        "- `OpenAI`: For accessing GPT models to generate synthetic infoboxes"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "aTg8VJNx80Jm"
      },
      "outputs": [],
      "source": [
        "!pip3 install datadreamer.dev datasets==3.2.0 OpenAI"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JXi89Ohqj9uo"
      },
      "source": [
        "## Step 2: Import Core Libraries\n",
        "\n",
        "Setting up the main components we'll use for data processing and LLM interactions."
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "from datadreamer import DataDreamer\n",
        "from datadreamer.llms import OpenAI\n",
        "from datadreamer.steps import ProcessWithPrompt, HFHubDataSource"
      ],
      "metadata": {
        "id": "ubobFqSkBbvk"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cdUF9THEj9up"
      },
      "source": [
        "## Step 3: Configure API Keys\n",
        "\n",
        "Using Colab's secure userdata to access API keys for:\n",
        "- **OpenAI**: To generate infoboxes using GPT-4o-mini\n",
        "- **Hugging Face**: To download datasets and upload results\n",
        "- **Weights & Biases**: For experiment tracking during training"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import os\n",
        "from google.colab import userdata\n",
        "\n",
        "os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')\n",
        "os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')\n",
        "os.environ['WANDB_API_KEY'] = userdata.get('WANDB_API_KEY')"
      ],
      "metadata": {
        "id": "QDojl9hS_pPU"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5E5mea5Aj9up"
      },
      "source": [
        "## Step 4: Initialize DataDreamer Session\n",
        "\n",
        "DataDreamer manages the entire pipeline and saves intermediate results to `./output/` for reproducibility."
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "dd = DataDreamer('./output/')\n",
        "dd.start()"
      ],
      "metadata": {
        "id": "6lpqylkw_ui1"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fhc8mYa9j9uq"
      },
      "source": [
        "## Step 5: Load Source Dataset\n",
        "\n",
        "Loading a curated dataset of Wikipedia stub articles about women in religion. These are short, incomplete articles that would benefit from having infoboxes added."
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "wiki_stubs_dataset = HFHubDataSource(\n",
        "    \"Get Women in Religion Stubs\",\n",
        "    \"andersoncliffb/women-in-religion-stubs\",\n",
        "    split=\"train\",\n",
        ").select_columns([\"Wiki_Content\"])"
      ],
      "metadata": {
        "id": "ZwT8N0ZCBt0b"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HBLR2-onj9uq"
      },
      "source": [
        "## Step 6: Configure the LLM\n",
        "\n",
        "Setting up GPT-4o-mini as our generation model. This is cost-effective for generating structured content like infoboxes."
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "gpt4 = OpenAI(\n",
        "    model_name=\"gpt-4o-mini\",\n",
        ")"
      ],
      "metadata": {
        "id": "6dR3jalOByoN"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3zxptzNwj9uq"
      },
      "source": [
        "## Step 7: Generate Infoboxes from Stubs\n",
        "\n",
        "This is the core synthetic data generation step. For each Wikipedia stub:\n",
        "1. Send it to GPT-4o-mini with instructions to create an appropriate infobox\n",
        "2. Store both the original stub and generated infobox as training pairs\n",
        "\n",
        "This creates the input-output pairs we'll use to train our T5 model."
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "stubs_and_infoboxes = ProcessWithPrompt(\n",
        "    \"Generate Infoboxes from Stubs\",\n",
        "    inputs={\"inputs\": wiki_stubs_dataset.output[\"Wiki_Content\"]},\n",
        "    args={\n",
        "        \"llm\": gpt4,\n",
        "        \"instruction\": (\n",
        "            \"Extract the infobox from the Wikipedia stub. If there is no infobox, generate an appropriate Wikipedia infobox for the stub.\"\n",
        "            \"Return only the infoxbox, nothing else.\"\n",
        "        ),\n",
        "    },\n",
        "    outputs={\"inputs\": \"stub\", \"generations\": \"infobox\"},\n",
        ").select_columns([\"stub\", \"infobox\"])\n"
      ],
      "metadata": {
        "id": "yj0Vi4Y6B2y-"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YGF5lo9Vj9ur"
      },
      "source": [
        "## Step 8: Publish Generated Dataset\n",
        "\n",
        "Uploading our synthetic dataset to Hugging Face Hub with:\n",
        "- 90% for training\n",
        "- 10% for validation\n",
        "\n",
        "This makes the dataset publicly available and creates the train/validation splits we need."
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "stubs_and_infoboxes.publish_to_hf_hub(\n",
        "    \"andersoncliffb/women-religion-stubs-with-infoboxes\",\n",
        "    train_size=0.90,\n",
        "    validation_size=0.10,\n",
        ")"
      ],
      "metadata": {
        "id": "-xXOkn1xNvrz"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VW-CaheUj9ur"
      },
      "source": [
        "## Step 9: Create Local Data Splits\n",
        "\n",
        "Creating local train/validation splits from our generated data for the fine-tuning process."
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "splits = stubs_and_infoboxes.splits(train_size=0.90, validation_size=0.10)"
      ],
      "metadata": {
        "id": "FZmC4DDfcJ11"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sAlWKcPCj9ur"
      },
      "source": [
        "## Step 10: Import Training Libraries\n",
        "\n",
        "Setting up for model fine-tuning:\n",
        "- `TrainHFFineTune`: DataDreamer's wrapper for Hugging Face model training\n",
        "- `LoraConfig`: Parameter-efficient fine-tuning using Low-Rank Adaptation"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "from datadreamer.trainers import TrainHFFineTune\n",
        "from peft import LoraConfig"
      ],
      "metadata": {
        "id": "MmmfDTbUg0qI"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eiqO2UDoj9ur"
      },
      "source": [
        "## Step 11: Configure the Training Setup\n",
        "\n",
        "Creating a trainer that will:\n",
        "- Use Google's T5-v1.1-base as the foundation model\n",
        "- Apply LoRA for efficient fine-tuning (only trains a small subset of parameters)\n",
        "- Learn to transform Wikipedia stubs into appropriate infoboxes"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "trainer = TrainHFFineTune(\n",
        "      \"Train an Wiki Article => Infoboxes Model\",\n",
        "      model_name=\"google/t5-v1_1-base\",\n",
        "      peft_config=LoraConfig(),\n",
        ")"
      ],
      "metadata": {
        "id": "XOyj4oT9WLDe"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CbnisLAMj9ur"
      },
      "source": [
        "## Step 12: Train the Model\n",
        "\n",
        "Starting the fine-tuning process with:\n",
        "- **Input**: Wikipedia stub articles\n",
        "- **Output**: Generated infoboxes\n",
        "- **30 epochs**: Multiple passes through the training data\n",
        "- **Batch size 8**: Number of examples processed simultaneously\n",
        "\n",
        "This will take some time and use the L4 GPU for training."
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "trainer.train(\n",
        "      train_input=splits[\"train\"].output[\"stub\"],\n",
        "      train_output=splits[\"train\"].output[\"infobox\"],\n",
        "      validation_input=splits[\"validation\"].output[\"stub\"],\n",
        "      validation_output=splits[\"validation\"].output[\"infobox\"],\n",
        "      epochs=30,\n",
        "      batch_size=8,\n",
        "  )"
      ],
      "metadata": {
        "id": "DjCegA3HWnBa"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SaQNXVfij9us"
      },
      "source": [
        "## Step 13: Publish the Fine-tuned Model\n",
        "\n",
        "Uploading the trained model to Hugging Face Hub so it can be:\n",
        "- Downloaded and used by others\n",
        "- Integrated into applications\n",
        "- Further fine-tuned on different data"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "trainer.publish_to_hf_hub(\"andersoncliffb/stubs-and-infoboxes\")\n"
      ],
      "metadata": {
        "id": "eQYEnqS6XgSH"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "T4OGzD0Dj9us"
      },
      "source": [
        "## Step 14: Clean Up\n",
        "\n",
        "Properly closing the DataDreamer session and saving all pipeline metadata."
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "dd.stop()"
      ],
      "metadata": {
        "id": "FQHSsi3kN4ay"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Cj4qz-Otj9us"
      },
      "source": [
        "## Summary\n",
        "\n",
        "This notebook demonstrates a complete pipeline for:\n",
        "1. **Synthetic data generation**: Using GPT-4o-mini to create training examples\n",
        "2. **Model fine-tuning**: Training T5 to learn the stub→infobox transformation\n",
        "3. **Knowledge sharing**: Publishing both dataset and model to Hugging Face Hub\n",
        "\n",
        "The resulting model can generate Wikipedia infoboxes from article stubs, potentially helping editors improve Wikipedia coverage of underrepresented topics."
      ]
    }
  ]
 }
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"private_outputs": true,
	"provenance": [],
	"machine_shape": "hm",
	"gpuType": "L4",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"language_info": {
	"name": "python"
	},
	"accelerator": "GPU"
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/CliffordAnderson/ff326d8e9d657acff82dda380dd3bc3a/synthetic-data-generation-of-wikipedia-infoboxes.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "8M4Lq_Eej9ul"
	},
	"source": [
	"# Wikipedia Infobox Generation and Model Fine-tuning\n",
	"\n",
	"This notebook demonstrates how to:\n",
	"1. Load Wikipedia stub articles about women in religion\n",
	"2. Use GPT-4o-mini to generate appropriate infoboxes for these stubs\n",
	"3. Fine-tune a T5 model to learn this stub → infobox transformation\n",
	"4. Publish the resulting dataset and model to Hugging Face Hub\n",
	"\n",
	"The goal is to create a model that can automatically generate Wikipedia infoboxes from article content."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "ElX2kqJ0j9un"
	},
	"source": [
	"## Step 1: Install Required Libraries\n",
	"\n",
	"We need:\n",
	"- `datadreamer.dev`: Framework for LLM-powered data generation and model training\n",
	"- `datasets`: Hugging Face library for handling datasets\n",
	"- `OpenAI`: For accessing GPT models to generate synthetic infoboxes"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "aTg8VJNx80Jm"
	},
	"outputs": [],
	"source": [
	"!pip3 install datadreamer.dev datasets==3.2.0 OpenAI"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "JXi89Ohqj9uo"
	},
	"source": [
	"## Step 2: Import Core Libraries\n",
	"\n",
	"Setting up the main components we'll use for data processing and LLM interactions."
	]
	},
	{
	"cell_type": "code",
	"source": [
	"from datadreamer import DataDreamer\n",
	"from datadreamer.llms import OpenAI\n",
	"from datadreamer.steps import ProcessWithPrompt, HFHubDataSource"
	],
	"metadata": {
	"id": "ubobFqSkBbvk"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "cdUF9THEj9up"
	},
	"source": [
	"## Step 3: Configure API Keys\n",
	"\n",
	"Using Colab's secure userdata to access API keys for:\n",
	"- OpenAI: To generate infoboxes using GPT-4o-mini\n",
	"- Hugging Face: To download datasets and upload results\n",
	"- Weights & Biases: For experiment tracking during training"
	]
	},
	{
	"cell_type": "code",
	"source": [
	"import os\n",
	"from google.colab import userdata\n",
	"\n",
	"os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')\n",
	"os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')\n",
	"os.environ['WANDB_API_KEY'] = userdata.get('WANDB_API_KEY')"
	],
	"metadata": {
	"id": "QDojl9hS_pPU"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "5E5mea5Aj9up"
	},
	"source": [
	"## Step 4: Initialize DataDreamer Session\n",
	"\n",
	"DataDreamer manages the entire pipeline and saves intermediate results to `./output/` for reproducibility."
	]
	},
	{
	"cell_type": "code",
	"source": [
	"dd = DataDreamer('./output/')\n",
	"dd.start()"
	],
	"metadata": {
	"id": "6lpqylkw_ui1"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "fhc8mYa9j9uq"
	},
	"source": [
	"## Step 5: Load Source Dataset\n",
	"\n",
	"Loading a curated dataset of Wikipedia stub articles about women in religion. These are short, incomplete articles that would benefit from having infoboxes added."
	]
	},
	{
	"cell_type": "code",
	"source": [
	"wiki_stubs_dataset = HFHubDataSource(\n",
	" \"Get Women in Religion Stubs\",\n",
	" \"andersoncliffb/women-in-religion-stubs\",\n",
	" split=\"train\",\n",
	").select_columns([\"Wiki_Content\"])"
	],
	"metadata": {
	"id": "ZwT8N0ZCBt0b"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "HBLR2-onj9uq"
	},
	"source": [
	"## Step 6: Configure the LLM\n",
	"\n",
	"Setting up GPT-4o-mini as our generation model. This is cost-effective for generating structured content like infoboxes."
	]
	},
	{
	"cell_type": "code",
	"source": [
	"gpt4 = OpenAI(\n",
	" model_name=\"gpt-4o-mini\",\n",
	")"
	],
	"metadata": {
	"id": "6dR3jalOByoN"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "3zxptzNwj9uq"
	},
	"source": [
	"## Step 7: Generate Infoboxes from Stubs\n",
	"\n",
	"This is the core synthetic data generation step. For each Wikipedia stub:\n",
	"1. Send it to GPT-4o-mini with instructions to create an appropriate infobox\n",
	"2. Store both the original stub and generated infobox as training pairs\n",
	"\n",
	"This creates the input-output pairs we'll use to train our T5 model."
	]
	},
	{
	"cell_type": "code",
	"source": [
	"stubs_and_infoboxes = ProcessWithPrompt(\n",
	" \"Generate Infoboxes from Stubs\",\n",
	" inputs={\"inputs\": wiki_stubs_dataset.output[\"Wiki_Content\"]},\n",
	" args={\n",
	" \"llm\": gpt4,\n",
	" \"instruction\": (\n",
	" \"Extract the infobox from the Wikipedia stub. If there is no infobox, generate an appropriate Wikipedia infobox for the stub.\"\n",
	" \"Return only the infoxbox, nothing else.\"\n",
	" ),\n",
	" },\n",
	" outputs={\"inputs\": \"stub\", \"generations\": \"infobox\"},\n",
	").select_columns([\"stub\", \"infobox\"])\n"
	],
	"metadata": {
	"id": "yj0Vi4Y6B2y-"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "YGF5lo9Vj9ur"
	},
	"source": [
	"## Step 8: Publish Generated Dataset\n",
	"\n",
	"Uploading our synthetic dataset to Hugging Face Hub with:\n",
	"- 90% for training\n",
	"- 10% for validation\n",
	"\n",
	"This makes the dataset publicly available and creates the train/validation splits we need."
	]
	},
	{
	"cell_type": "code",
	"source": [
	"stubs_and_infoboxes.publish_to_hf_hub(\n",
	" \"andersoncliffb/women-religion-stubs-with-infoboxes\",\n",
	" train_size=0.90,\n",
	" validation_size=0.10,\n",
	")"
	],
	"metadata": {
	"id": "-xXOkn1xNvrz"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "VW-CaheUj9ur"
	},
	"source": [
	"## Step 9: Create Local Data Splits\n",
	"\n",
	"Creating local train/validation splits from our generated data for the fine-tuning process."
	]
	},
	{
	"cell_type": "code",
	"source": [
	"splits = stubs_and_infoboxes.splits(train_size=0.90, validation_size=0.10)"
	],
	"metadata": {
	"id": "FZmC4DDfcJ11"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "sAlWKcPCj9ur"
	},
	"source": [
	"## Step 10: Import Training Libraries\n",
	"\n",
	"Setting up for model fine-tuning:\n",
	"- `TrainHFFineTune`: DataDreamer's wrapper for Hugging Face model training\n",
	"- `LoraConfig`: Parameter-efficient fine-tuning using Low-Rank Adaptation"
	]
	},
	{
	"cell_type": "code",
	"source": [
	"from datadreamer.trainers import TrainHFFineTune\n",
	"from peft import LoraConfig"
	],
	"metadata": {
	"id": "MmmfDTbUg0qI"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "eiqO2UDoj9ur"
	},
	"source": [
	"## Step 11: Configure the Training Setup\n",
	"\n",
	"Creating a trainer that will:\n",
	"- Use Google's T5-v1.1-base as the foundation model\n",
	"- Apply LoRA for efficient fine-tuning (only trains a small subset of parameters)\n",
	"- Learn to transform Wikipedia stubs into appropriate infoboxes"
	]
	},
	{
	"cell_type": "code",
	"source": [
	"trainer = TrainHFFineTune(\n",
	" \"Train an Wiki Article => Infoboxes Model\",\n",
	" model_name=\"google/t5-v1_1-base\",\n",
	" peft_config=LoraConfig(),\n",
	")"
	],
	"metadata": {
	"id": "XOyj4oT9WLDe"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "CbnisLAMj9ur"
	},
	"source": [
	"## Step 12: Train the Model\n",
	"\n",
	"Starting the fine-tuning process with:\n",
	"- Input: Wikipedia stub articles\n",
	"- Output: Generated infoboxes\n",
	"- 30 epochs: Multiple passes through the training data\n",
	"- Batch size 8: Number of examples processed simultaneously\n",
	"\n",
	"This will take some time and use the L4 GPU for training."
	]
	},
	{
	"cell_type": "code",
	"source": [
	"trainer.train(\n",
	" train_input=splits[\"train\"].output[\"stub\"],\n",
	" train_output=splits[\"train\"].output[\"infobox\"],\n",
	" validation_input=splits[\"validation\"].output[\"stub\"],\n",
	" validation_output=splits[\"validation\"].output[\"infobox\"],\n",
	" epochs=30,\n",
	" batch_size=8,\n",
	" )"
	],
	"metadata": {
	"id": "DjCegA3HWnBa"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "SaQNXVfij9us"
	},
	"source": [
	"## Step 13: Publish the Fine-tuned Model\n",
	"\n",
	"Uploading the trained model to Hugging Face Hub so it can be:\n",
	"- Downloaded and used by others\n",
	"- Integrated into applications\n",
	"- Further fine-tuned on different data"
	]
	},
	{
	"cell_type": "code",
	"source": [
	"trainer.publish_to_hf_hub(\"andersoncliffb/stubs-and-infoboxes\")\n"
	],
	"metadata": {
	"id": "eQYEnqS6XgSH"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "T4OGzD0Dj9us"
	},
	"source": [
	"## Step 14: Clean Up\n",
	"\n",
	"Properly closing the DataDreamer session and saving all pipeline metadata."
	]
	},
	{
	"cell_type": "code",
	"source": [
	"dd.stop()"
	],
	"metadata": {
	"id": "FQHSsi3kN4ay"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Cj4qz-Otj9us"
	},
	"source": [
	"## Summary\n",
	"\n",
	"This notebook demonstrates a complete pipeline for:\n",
	"1. Synthetic data generation: Using GPT-4o-mini to create training examples\n",
	"2. Model fine-tuning: Training T5 to learn the stub→infobox transformation\n",
	"3. Knowledge sharing: Publishing both dataset and model to Hugging Face Hub\n",
	"\n",
	"The resulting model can generate Wikipedia infoboxes from article stubs, potentially helping editors improve Wikipedia coverage of underrepresented topics."
	]
	}
	]
	}