Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save CliffordAnderson/ff326d8e9d657acff82dda380dd3bc3a to your computer and use it in GitHub Desktop.
Save CliffordAnderson/ff326d8e9d657acff82dda380dd3bc3a to your computer and use it in GitHub Desktop.
synthetic-data-generation-of-wikipedia-infoboxes.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"private_outputs": true,
"provenance": [],
"machine_shape": "hm",
"gpuType": "L4",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/CliffordAnderson/ff326d8e9d657acff82dda380dd3bc3a/synthetic-data-generation-of-wikipedia-infoboxes.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8M4Lq_Eej9ul"
},
"source": [
"# Wikipedia Infobox Generation and Model Fine-tuning\n",
"\n",
"This notebook demonstrates how to:\n",
"1. Load Wikipedia stub articles about women in religion\n",
"2. Use GPT-4o-mini to generate appropriate infoboxes for these stubs\n",
"3. Fine-tune a T5 model to learn this stub → infobox transformation\n",
"4. Publish the resulting dataset and model to Hugging Face Hub\n",
"\n",
"The goal is to create a model that can automatically generate Wikipedia infoboxes from article content."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ElX2kqJ0j9un"
},
"source": [
"## Step 1: Install Required Libraries\n",
"\n",
"We need:\n",
"- `datadreamer.dev`: Framework for LLM-powered data generation and model training\n",
"- `datasets`: Hugging Face library for handling datasets\n",
"- `OpenAI`: For accessing GPT models to generate synthetic infoboxes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "aTg8VJNx80Jm"
},
"outputs": [],
"source": [
"!pip3 install datadreamer.dev datasets==3.2.0 OpenAI"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JXi89Ohqj9uo"
},
"source": [
"## Step 2: Import Core Libraries\n",
"\n",
"Setting up the main components we'll use for data processing and LLM interactions."
]
},
{
"cell_type": "code",
"source": [
"from datadreamer import DataDreamer\n",
"from datadreamer.llms import OpenAI\n",
"from datadreamer.steps import ProcessWithPrompt, HFHubDataSource"
],
"metadata": {
"id": "ubobFqSkBbvk"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "cdUF9THEj9up"
},
"source": [
"## Step 3: Configure API Keys\n",
"\n",
"Using Colab's secure userdata to access API keys for:\n",
"- **OpenAI**: To generate infoboxes using GPT-4o-mini\n",
"- **Hugging Face**: To download datasets and upload results\n",
"- **Weights & Biases**: For experiment tracking during training"
]
},
{
"cell_type": "code",
"source": [
"import os\n",
"from google.colab import userdata\n",
"\n",
"os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')\n",
"os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')\n",
"os.environ['WANDB_API_KEY'] = userdata.get('WANDB_API_KEY')"
],
"metadata": {
"id": "QDojl9hS_pPU"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "5E5mea5Aj9up"
},
"source": [
"## Step 4: Initialize DataDreamer Session\n",
"\n",
"DataDreamer manages the entire pipeline and saves intermediate results to `./output/` for reproducibility."
]
},
{
"cell_type": "code",
"source": [
"dd = DataDreamer('./output/')\n",
"dd.start()"
],
"metadata": {
"id": "6lpqylkw_ui1"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "fhc8mYa9j9uq"
},
"source": [
"## Step 5: Load Source Dataset\n",
"\n",
"Loading a curated dataset of Wikipedia stub articles about women in religion. These are short, incomplete articles that would benefit from having infoboxes added."
]
},
{
"cell_type": "code",
"source": [
"wiki_stubs_dataset = HFHubDataSource(\n",
" \"Get Women in Religion Stubs\",\n",
" \"andersoncliffb/women-in-religion-stubs\",\n",
" split=\"train\",\n",
").select_columns([\"Wiki_Content\"])"
],
"metadata": {
"id": "ZwT8N0ZCBt0b"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "HBLR2-onj9uq"
},
"source": [
"## Step 6: Configure the LLM\n",
"\n",
"Setting up GPT-4o-mini as our generation model. This is cost-effective for generating structured content like infoboxes."
]
},
{
"cell_type": "code",
"source": [
"gpt4 = OpenAI(\n",
" model_name=\"gpt-4o-mini\",\n",
")"
],
"metadata": {
"id": "6dR3jalOByoN"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "3zxptzNwj9uq"
},
"source": [
"## Step 7: Generate Infoboxes from Stubs\n",
"\n",
"This is the core synthetic data generation step. For each Wikipedia stub:\n",
"1. Send it to GPT-4o-mini with instructions to create an appropriate infobox\n",
"2. Store both the original stub and generated infobox as training pairs\n",
"\n",
"This creates the input-output pairs we'll use to train our T5 model."
]
},
{
"cell_type": "code",
"source": [
"stubs_and_infoboxes = ProcessWithPrompt(\n",
" \"Generate Infoboxes from Stubs\",\n",
" inputs={\"inputs\": wiki_stubs_dataset.output[\"Wiki_Content\"]},\n",
" args={\n",
" \"llm\": gpt4,\n",
" \"instruction\": (\n",
" \"Extract the infobox from the Wikipedia stub. If there is no infobox, generate an appropriate Wikipedia infobox for the stub.\"\n",
" \"Return only the infoxbox, nothing else.\"\n",
" ),\n",
" },\n",
" outputs={\"inputs\": \"stub\", \"generations\": \"infobox\"},\n",
").select_columns([\"stub\", \"infobox\"])\n"
],
"metadata": {
"id": "yj0Vi4Y6B2y-"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "YGF5lo9Vj9ur"
},
"source": [
"## Step 8: Publish Generated Dataset\n",
"\n",
"Uploading our synthetic dataset to Hugging Face Hub with:\n",
"- 90% for training\n",
"- 10% for validation\n",
"\n",
"This makes the dataset publicly available and creates the train/validation splits we need."
]
},
{
"cell_type": "code",
"source": [
"stubs_and_infoboxes.publish_to_hf_hub(\n",
" \"andersoncliffb/women-religion-stubs-with-infoboxes\",\n",
" train_size=0.90,\n",
" validation_size=0.10,\n",
")"
],
"metadata": {
"id": "-xXOkn1xNvrz"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "VW-CaheUj9ur"
},
"source": [
"## Step 9: Create Local Data Splits\n",
"\n",
"Creating local train/validation splits from our generated data for the fine-tuning process."
]
},
{
"cell_type": "code",
"source": [
"splits = stubs_and_infoboxes.splits(train_size=0.90, validation_size=0.10)"
],
"metadata": {
"id": "FZmC4DDfcJ11"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "sAlWKcPCj9ur"
},
"source": [
"## Step 10: Import Training Libraries\n",
"\n",
"Setting up for model fine-tuning:\n",
"- `TrainHFFineTune`: DataDreamer's wrapper for Hugging Face model training\n",
"- `LoraConfig`: Parameter-efficient fine-tuning using Low-Rank Adaptation"
]
},
{
"cell_type": "code",
"source": [
"from datadreamer.trainers import TrainHFFineTune\n",
"from peft import LoraConfig"
],
"metadata": {
"id": "MmmfDTbUg0qI"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "eiqO2UDoj9ur"
},
"source": [
"## Step 11: Configure the Training Setup\n",
"\n",
"Creating a trainer that will:\n",
"- Use Google's T5-v1.1-base as the foundation model\n",
"- Apply LoRA for efficient fine-tuning (only trains a small subset of parameters)\n",
"- Learn to transform Wikipedia stubs into appropriate infoboxes"
]
},
{
"cell_type": "code",
"source": [
"trainer = TrainHFFineTune(\n",
" \"Train an Wiki Article => Infoboxes Model\",\n",
" model_name=\"google/t5-v1_1-base\",\n",
" peft_config=LoraConfig(),\n",
")"
],
"metadata": {
"id": "XOyj4oT9WLDe"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "CbnisLAMj9ur"
},
"source": [
"## Step 12: Train the Model\n",
"\n",
"Starting the fine-tuning process with:\n",
"- **Input**: Wikipedia stub articles\n",
"- **Output**: Generated infoboxes\n",
"- **30 epochs**: Multiple passes through the training data\n",
"- **Batch size 8**: Number of examples processed simultaneously\n",
"\n",
"This will take some time and use the L4 GPU for training."
]
},
{
"cell_type": "code",
"source": [
"trainer.train(\n",
" train_input=splits[\"train\"].output[\"stub\"],\n",
" train_output=splits[\"train\"].output[\"infobox\"],\n",
" validation_input=splits[\"validation\"].output[\"stub\"],\n",
" validation_output=splits[\"validation\"].output[\"infobox\"],\n",
" epochs=30,\n",
" batch_size=8,\n",
" )"
],
"metadata": {
"id": "DjCegA3HWnBa"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "SaQNXVfij9us"
},
"source": [
"## Step 13: Publish the Fine-tuned Model\n",
"\n",
"Uploading the trained model to Hugging Face Hub so it can be:\n",
"- Downloaded and used by others\n",
"- Integrated into applications\n",
"- Further fine-tuned on different data"
]
},
{
"cell_type": "code",
"source": [
"trainer.publish_to_hf_hub(\"andersoncliffb/stubs-and-infoboxes\")\n"
],
"metadata": {
"id": "eQYEnqS6XgSH"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "T4OGzD0Dj9us"
},
"source": [
"## Step 14: Clean Up\n",
"\n",
"Properly closing the DataDreamer session and saving all pipeline metadata."
]
},
{
"cell_type": "code",
"source": [
"dd.stop()"
],
"metadata": {
"id": "FQHSsi3kN4ay"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Cj4qz-Otj9us"
},
"source": [
"## Summary\n",
"\n",
"This notebook demonstrates a complete pipeline for:\n",
"1. **Synthetic data generation**: Using GPT-4o-mini to create training examples\n",
"2. **Model fine-tuning**: Training T5 to learn the stub→infobox transformation\n",
"3. **Knowledge sharing**: Publishing both dataset and model to Hugging Face Hub\n",
"\n",
"The resulting model can generate Wikipedia infoboxes from article stubs, potentially helping editors improve Wikipedia coverage of underrepresented topics."
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment