Created
June 30, 2025 12:55
-
-
Save CliffordAnderson/ff326d8e9d657acff82dda380dd3bc3a to your computer and use it in GitHub Desktop.
synthetic-data-generation-of-wikipedia-infoboxes.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"private_outputs": true, | |
"provenance": [], | |
"machine_shape": "hm", | |
"gpuType": "L4", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"language_info": { | |
"name": "python" | |
}, | |
"accelerator": "GPU" | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/CliffordAnderson/ff326d8e9d657acff82dda380dd3bc3a/synthetic-data-generation-of-wikipedia-infoboxes.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "8M4Lq_Eej9ul" | |
}, | |
"source": [ | |
"# Wikipedia Infobox Generation and Model Fine-tuning\n", | |
"\n", | |
"This notebook demonstrates how to:\n", | |
"1. Load Wikipedia stub articles about women in religion\n", | |
"2. Use GPT-4o-mini to generate appropriate infoboxes for these stubs\n", | |
"3. Fine-tune a T5 model to learn this stub → infobox transformation\n", | |
"4. Publish the resulting dataset and model to Hugging Face Hub\n", | |
"\n", | |
"The goal is to create a model that can automatically generate Wikipedia infoboxes from article content." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "ElX2kqJ0j9un" | |
}, | |
"source": [ | |
"## Step 1: Install Required Libraries\n", | |
"\n", | |
"We need:\n", | |
"- `datadreamer.dev`: Framework for LLM-powered data generation and model training\n", | |
"- `datasets`: Hugging Face library for handling datasets\n", | |
"- `OpenAI`: For accessing GPT models to generate synthetic infoboxes" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "aTg8VJNx80Jm" | |
}, | |
"outputs": [], | |
"source": [ | |
"!pip3 install datadreamer.dev datasets==3.2.0 OpenAI" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "JXi89Ohqj9uo" | |
}, | |
"source": [ | |
"## Step 2: Import Core Libraries\n", | |
"\n", | |
"Setting up the main components we'll use for data processing and LLM interactions." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from datadreamer import DataDreamer\n", | |
"from datadreamer.llms import OpenAI\n", | |
"from datadreamer.steps import ProcessWithPrompt, HFHubDataSource" | |
], | |
"metadata": { | |
"id": "ubobFqSkBbvk" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "cdUF9THEj9up" | |
}, | |
"source": [ | |
"## Step 3: Configure API Keys\n", | |
"\n", | |
"Using Colab's secure userdata to access API keys for:\n", | |
"- **OpenAI**: To generate infoboxes using GPT-4o-mini\n", | |
"- **Hugging Face**: To download datasets and upload results\n", | |
"- **Weights & Biases**: For experiment tracking during training" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"import os\n", | |
"from google.colab import userdata\n", | |
"\n", | |
"os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')\n", | |
"os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')\n", | |
"os.environ['WANDB_API_KEY'] = userdata.get('WANDB_API_KEY')" | |
], | |
"metadata": { | |
"id": "QDojl9hS_pPU" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "5E5mea5Aj9up" | |
}, | |
"source": [ | |
"## Step 4: Initialize DataDreamer Session\n", | |
"\n", | |
"DataDreamer manages the entire pipeline and saves intermediate results to `./output/` for reproducibility." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"dd = DataDreamer('./output/')\n", | |
"dd.start()" | |
], | |
"metadata": { | |
"id": "6lpqylkw_ui1" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "fhc8mYa9j9uq" | |
}, | |
"source": [ | |
"## Step 5: Load Source Dataset\n", | |
"\n", | |
"Loading a curated dataset of Wikipedia stub articles about women in religion. These are short, incomplete articles that would benefit from having infoboxes added." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"wiki_stubs_dataset = HFHubDataSource(\n", | |
" \"Get Women in Religion Stubs\",\n", | |
" \"andersoncliffb/women-in-religion-stubs\",\n", | |
" split=\"train\",\n", | |
").select_columns([\"Wiki_Content\"])" | |
], | |
"metadata": { | |
"id": "ZwT8N0ZCBt0b" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "HBLR2-onj9uq" | |
}, | |
"source": [ | |
"## Step 6: Configure the LLM\n", | |
"\n", | |
"Setting up GPT-4o-mini as our generation model. This is cost-effective for generating structured content like infoboxes." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"gpt4 = OpenAI(\n", | |
" model_name=\"gpt-4o-mini\",\n", | |
")" | |
], | |
"metadata": { | |
"id": "6dR3jalOByoN" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "3zxptzNwj9uq" | |
}, | |
"source": [ | |
"## Step 7: Generate Infoboxes from Stubs\n", | |
"\n", | |
"This is the core synthetic data generation step. For each Wikipedia stub:\n", | |
"1. Send it to GPT-4o-mini with instructions to create an appropriate infobox\n", | |
"2. Store both the original stub and generated infobox as training pairs\n", | |
"\n", | |
"This creates the input-output pairs we'll use to train our T5 model." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"stubs_and_infoboxes = ProcessWithPrompt(\n", | |
" \"Generate Infoboxes from Stubs\",\n", | |
" inputs={\"inputs\": wiki_stubs_dataset.output[\"Wiki_Content\"]},\n", | |
" args={\n", | |
" \"llm\": gpt4,\n", | |
" \"instruction\": (\n", | |
" \"Extract the infobox from the Wikipedia stub. If there is no infobox, generate an appropriate Wikipedia infobox for the stub.\"\n", | |
" \"Return only the infoxbox, nothing else.\"\n", | |
" ),\n", | |
" },\n", | |
" outputs={\"inputs\": \"stub\", \"generations\": \"infobox\"},\n", | |
").select_columns([\"stub\", \"infobox\"])\n" | |
], | |
"metadata": { | |
"id": "yj0Vi4Y6B2y-" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "YGF5lo9Vj9ur" | |
}, | |
"source": [ | |
"## Step 8: Publish Generated Dataset\n", | |
"\n", | |
"Uploading our synthetic dataset to Hugging Face Hub with:\n", | |
"- 90% for training\n", | |
"- 10% for validation\n", | |
"\n", | |
"This makes the dataset publicly available and creates the train/validation splits we need." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"stubs_and_infoboxes.publish_to_hf_hub(\n", | |
" \"andersoncliffb/women-religion-stubs-with-infoboxes\",\n", | |
" train_size=0.90,\n", | |
" validation_size=0.10,\n", | |
")" | |
], | |
"metadata": { | |
"id": "-xXOkn1xNvrz" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "VW-CaheUj9ur" | |
}, | |
"source": [ | |
"## Step 9: Create Local Data Splits\n", | |
"\n", | |
"Creating local train/validation splits from our generated data for the fine-tuning process." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"splits = stubs_and_infoboxes.splits(train_size=0.90, validation_size=0.10)" | |
], | |
"metadata": { | |
"id": "FZmC4DDfcJ11" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "sAlWKcPCj9ur" | |
}, | |
"source": [ | |
"## Step 10: Import Training Libraries\n", | |
"\n", | |
"Setting up for model fine-tuning:\n", | |
"- `TrainHFFineTune`: DataDreamer's wrapper for Hugging Face model training\n", | |
"- `LoraConfig`: Parameter-efficient fine-tuning using Low-Rank Adaptation" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from datadreamer.trainers import TrainHFFineTune\n", | |
"from peft import LoraConfig" | |
], | |
"metadata": { | |
"id": "MmmfDTbUg0qI" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "eiqO2UDoj9ur" | |
}, | |
"source": [ | |
"## Step 11: Configure the Training Setup\n", | |
"\n", | |
"Creating a trainer that will:\n", | |
"- Use Google's T5-v1.1-base as the foundation model\n", | |
"- Apply LoRA for efficient fine-tuning (only trains a small subset of parameters)\n", | |
"- Learn to transform Wikipedia stubs into appropriate infoboxes" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"trainer = TrainHFFineTune(\n", | |
" \"Train an Wiki Article => Infoboxes Model\",\n", | |
" model_name=\"google/t5-v1_1-base\",\n", | |
" peft_config=LoraConfig(),\n", | |
")" | |
], | |
"metadata": { | |
"id": "XOyj4oT9WLDe" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "CbnisLAMj9ur" | |
}, | |
"source": [ | |
"## Step 12: Train the Model\n", | |
"\n", | |
"Starting the fine-tuning process with:\n", | |
"- **Input**: Wikipedia stub articles\n", | |
"- **Output**: Generated infoboxes\n", | |
"- **30 epochs**: Multiple passes through the training data\n", | |
"- **Batch size 8**: Number of examples processed simultaneously\n", | |
"\n", | |
"This will take some time and use the L4 GPU for training." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"trainer.train(\n", | |
" train_input=splits[\"train\"].output[\"stub\"],\n", | |
" train_output=splits[\"train\"].output[\"infobox\"],\n", | |
" validation_input=splits[\"validation\"].output[\"stub\"],\n", | |
" validation_output=splits[\"validation\"].output[\"infobox\"],\n", | |
" epochs=30,\n", | |
" batch_size=8,\n", | |
" )" | |
], | |
"metadata": { | |
"id": "DjCegA3HWnBa" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "SaQNXVfij9us" | |
}, | |
"source": [ | |
"## Step 13: Publish the Fine-tuned Model\n", | |
"\n", | |
"Uploading the trained model to Hugging Face Hub so it can be:\n", | |
"- Downloaded and used by others\n", | |
"- Integrated into applications\n", | |
"- Further fine-tuned on different data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"trainer.publish_to_hf_hub(\"andersoncliffb/stubs-and-infoboxes\")\n" | |
], | |
"metadata": { | |
"id": "eQYEnqS6XgSH" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "T4OGzD0Dj9us" | |
}, | |
"source": [ | |
"## Step 14: Clean Up\n", | |
"\n", | |
"Properly closing the DataDreamer session and saving all pipeline metadata." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"dd.stop()" | |
], | |
"metadata": { | |
"id": "FQHSsi3kN4ay" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "Cj4qz-Otj9us" | |
}, | |
"source": [ | |
"## Summary\n", | |
"\n", | |
"This notebook demonstrates a complete pipeline for:\n", | |
"1. **Synthetic data generation**: Using GPT-4o-mini to create training examples\n", | |
"2. **Model fine-tuning**: Training T5 to learn the stub→infobox transformation\n", | |
"3. **Knowledge sharing**: Publishing both dataset and model to Hugging Face Hub\n", | |
"\n", | |
"The resulting model can generate Wikipedia infoboxes from article stubs, potentially helping editors improve Wikipedia coverage of underrepresented topics." | |
] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment