timruffles · September 23, 2024 12:04
diff --git a/notebook.ipynb b/notebook.ipynb
 {
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "# Identifying Text Classification Features for Prompts vs Source Code\n\nThe goal of this notebook is to analyze and identify effective text classification features (regex patterns) that can distinguish between user prompts (ideally code-focused) and source code. We will:\n\n- Download corpuses of user prompts and source code.\n- Define a list of regex patterns to serve as text features.\n- Apply these regexes to both datasets.\n- Generate statistics on their effectiveness.\n"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "## 1. Setup and Data Acquisition\n\nWe will use the following datasets:\n\n- **User Prompts:** We'll use the [StackOverflow Questions](https://archive.org/details/stackexchange) dataset as a proxy for code-focused user prompts.\n- **Source Code:** We'll use the [CodeSearchNet](https://github.com/github/CodeSearchNet) dataset, which contains code from multiple programming languages.\n"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "### 1.1. Install Required Libraries"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": "import requests\nimport zipfile\nimport io\nimport os\nimport re\nimport pandas as pd\nimport matplotlib.pyplot as plt\n"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "### 1.2. Download and Extract the StackOverflow Questions Dataset\n\nWe'll download a sample of the StackOverflow dataset for demonstration purposes."
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": "# Download StackOverflow dataset\nstackoverflow_url = 'https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z'\n\n# Since the .7z format is not directly supported by Python, we'll use a pre-processed CSV for this example\n# Alternatively, you can use the StackLite dataset\nquestions_url = 'https://raw.githubusercontent.com/jacoxu/StackOverflow/master/data/train.txt'\n\nresponse = requests.get(questions_url)\n\n# Save the dataset\nwith open('stackoverflow_questions.txt', 'wb') as f:\n    f.write(response.content)\n"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "### 1.3. Download and Extract the CodeSearchNet Dataset\n\nWe'll use the Python subset of the CodeSearchNet dataset."
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": "# Download CodeSearchNet dataset for Python code\ncodesearchnet_url = 'https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip'\n\nresponse = requests.get(codesearchnet_url)\n\nwith zipfile.ZipFile(io.BytesIO(response.content)) as z:\n    z.extractall('codesearchnet')\n"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "## 2. Data Preparation\n\nWe'll load the datasets and prepare them for regex matching."
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "### 2.1. Load StackOverflow Questions"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": "# Read the questions\nwith open('stackoverflow_questions.txt', 'r', encoding='utf-8') as f:\n    questions = f.readlines()\n\n# For demonstration, we'll take the first 10,000 questions\nquestions = questions[:10000]\n\nprint(f\"Loaded {len(questions)} questions.\")\n"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "### 2.2. Load CodeSearchNet Source Code"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": "# Read Python code files\ncode_texts = []\ncode_dir = 'codesearchnet/resources/python/final/jsonl/train/'\n\nfor filename in os.listdir(code_dir):\n    if filename.endswith('.jsonl'):\n        filepath = os.path.join(code_dir, filename)\n        with open(filepath, 'r', encoding='utf-8') as f:\n            for line in f:\n                data = eval(line)\n                code = data.get('code')\n                if code:\n                    code_texts.append(code)\n\n# For demonstration, we'll take the first 10,000 code snippets\ncode_texts = code_texts[:10000]\n\nprint(f\"Loaded {len(code_texts)} code snippets.\")\n"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "## 3. Define Text Feature Regexes\n\nWe'll define a list of regex patterns that might help differentiate between user prompts and source code."
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": "# Define regex patterns\nregex_patterns = {\n    'Contains import statement': r'\\bimport\\b',\n    'Contains def statement': r'\\bdef\\b',\n    'Contains class declaration': r'\\bclass\\b',\n    'Contains print statement': r'\\bprint\\s*\\(',\n    'Contains if statement': r'\\bif\\b',\n    'Contains for loop': r'\\bfor\\b',\n    'Contains while loop': r'\\bwhile\\b',\n    'Contains function call': r'\\w+\\s*\\(',\n    'Contains code comment': r'#.*',\n    'Contains HTML tag': r'<[^>]+>',\n    'Contains URL': r'https?://\\S+',\n    'Contains code block (Python indentation)': r'\\n\\s{4,}\\S',\n}\n"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "## 4. Apply Regex Patterns and Generate Statistics\n\nWe'll iterate over each text in both datasets, apply the regex patterns, and collect statistics."
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": "# Function to analyze texts\n\ndef analyze_texts(texts, regex_patterns):\n    results = {pattern_name: {'match_count': 0, 'total_chars_matched': 0} for pattern_name in regex_patterns}\n    total_texts = len(texts)\n    total_chars = sum(len(text) for text in texts)\n    \n    for text in texts:\n        for pattern_name, pattern in regex_patterns.items():\n            match = re.search(pattern, text)\n            if match:\n                results[pattern_name]['match_count'] += 1\n                results[pattern_name]['total_chars_matched'] += len(match.group(0))\n                break  # Only consider the first matching pattern\n    \n    # Calculate ratios\n    for pattern_name in regex_patterns:\n        results[pattern_name]['match_ratio'] = results[pattern_name]['match_count'] / total_texts\n        results[pattern_name]['char_ratio'] = results[pattern_name]['total_chars_matched'] / total_chars\n    \n    return results\n"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "### 4.1. Analyze User Prompts"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": "prompt_results = analyze_texts(questions, regex_patterns)\n"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "### 4.2. Analyze Source Code"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": "code_results = analyze_texts(code_texts, regex_patterns)\n"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "## 5. Results and Statistics\n\nWe'll present the statistics in a table for easy comparison."
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": "# Create DataFrame for prompts\nprompt_df = pd.DataFrame.from_dict(prompt_results, orient='index')\nprompt_df['Type'] = 'Prompt'\n\n# Create DataFrame for code\ncode_df = pd.DataFrame.from_dict(code_results, orient='index')\ncode_df['Type'] = 'Code'\n\n# Combine DataFrames\nresults_df = pd.concat([prompt_df, code_df])\n\n# Reset index\nresults_df = results_df.reset_index().rename(columns={'index': 'Pattern'})\n\n# Display the results\nresults_df[['Pattern', 'Type', 'match_count', 'match_ratio', 'total_chars_matched', 'char_ratio']]\n"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "### 5.1. Interpretation\n\nFrom the results, we can observe which regex patterns are more effective at matching user prompts versus source code. Patterns with a higher match ratio in source code compared to prompts may be good candidates for features that identify source code."
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "## 6. Visualization\n\nLet's visualize the match ratios for prompts and code."
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": "# Pivot the DataFrame for plotting\npivot_df = results_df.pivot(index='Pattern', columns='Type', values='match_ratio')\n\n# Plot\npivot_df.plot(kind='bar', figsize=(12,6))\nplt.ylabel('Match Ratio')\nplt.title('Regex Pattern Match Ratios for Prompts vs Source Code')\nplt.show()\n"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": "## 7. Conclusion\n\nBy analyzing the match ratios and character ratios of different regex patterns on user prompts and source code, we can identify features that are effective in distinguishing between the two. For example, patterns like 'Contains def statement' or 'Contains import statement' may have a higher match ratio in source code, making them good features for classification."
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": ""
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "# Identifying Text Classification Features for Prompts vs Source Code\n\nThe goal of this notebook is to analyze and identify effective text classification features (regex patterns) that can distinguish between user prompts (ideally code-focused) and source code. We will:\n\n- Download corpuses of user prompts and source code.\n- Define a list of regex patterns to serve as text features.\n- Apply these regexes to both datasets.\n- Generate statistics on their effectiveness.\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "## 1. Setup and Data Acquisition\n\nWe will use the following datasets:\n\n- User Prompts: We'll use the [StackOverflow Questions](https://archive.org/details/stackexchange) dataset as a proxy for code-focused user prompts.\n- Source Code: We'll use the [CodeSearchNet](https://github.com/github/CodeSearchNet) dataset, which contains code from multiple programming languages.\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "### 1.1. Install Required Libraries"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "import requests\nimport zipfile\nimport io\nimport os\nimport re\nimport pandas as pd\nimport matplotlib.pyplot as plt\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "### 1.2. Download and Extract the StackOverflow Questions Dataset\n\nWe'll download a sample of the StackOverflow dataset for demonstration purposes."
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "# Download StackOverflow dataset\nstackoverflow_url = 'https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z'\n\n# Since the .7z format is not directly supported by Python, we'll use a pre-processed CSV for this example\n# Alternatively, you can use the StackLite dataset\nquestions_url = 'https://raw.githubusercontent.com/jacoxu/StackOverflow/master/data/train.txt'\n\nresponse = requests.get(questions_url)\n\n# Save the dataset\nwith open('stackoverflow_questions.txt', 'wb') as f:\n f.write(response.content)\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "### 1.3. Download and Extract the CodeSearchNet Dataset\n\nWe'll use the Python subset of the CodeSearchNet dataset."
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "# Download CodeSearchNet dataset for Python code\ncodesearchnet_url = 'https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip'\n\nresponse = requests.get(codesearchnet_url)\n\nwith zipfile.ZipFile(io.BytesIO(response.content)) as z:\n z.extractall('codesearchnet')\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "## 2. Data Preparation\n\nWe'll load the datasets and prepare them for regex matching."
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "### 2.1. Load StackOverflow Questions"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "# Read the questions\nwith open('stackoverflow_questions.txt', 'r', encoding='utf-8') as f:\n questions = f.readlines()\n\n# For demonstration, we'll take the first 10,000 questions\nquestions = questions[:10000]\n\nprint(f\"Loaded {len(questions)} questions.\")\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "### 2.2. Load CodeSearchNet Source Code"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "# Read Python code files\ncode_texts = []\ncode_dir = 'codesearchnet/resources/python/final/jsonl/train/'\n\nfor filename in os.listdir(code_dir):\n if filename.endswith('.jsonl'):\n filepath = os.path.join(code_dir, filename)\n with open(filepath, 'r', encoding='utf-8') as f:\n for line in f:\n data = eval(line)\n code = data.get('code')\n if code:\n code_texts.append(code)\n\n# For demonstration, we'll take the first 10,000 code snippets\ncode_texts = code_texts[:10000]\n\nprint(f\"Loaded {len(code_texts)} code snippets.\")\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "## 3. Define Text Feature Regexes\n\nWe'll define a list of regex patterns that might help differentiate between user prompts and source code."
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "# Define regex patterns\nregex_patterns = {\n 'Contains import statement': r'\\bimport\\b',\n 'Contains def statement': r'\\bdef\\b',\n 'Contains class declaration': r'\\bclass\\b',\n 'Contains print statement': r'\\bprint\\s\\(',\n 'Contains if statement': r'\\bif\\b',\n 'Contains for loop': r'\\bfor\\b',\n 'Contains while loop': r'\\bwhile\\b',\n 'Contains function call': r'\\w+\\s\\(',\n 'Contains code comment': r'#.*',\n 'Contains HTML tag': r'<[^>]+>',\n 'Contains URL': r'https?://\\S+',\n 'Contains code block (Python indentation)': r'\\n\\s{4,}\\S',\n}\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "## 4. Apply Regex Patterns and Generate Statistics\n\nWe'll iterate over each text in both datasets, apply the regex patterns, and collect statistics."
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "# Function to analyze texts\n\ndef analyze_texts(texts, regex_patterns):\n results = {pattern_name: {'match_count': 0, 'total_chars_matched': 0} for pattern_name in regex_patterns}\n total_texts = len(texts)\n total_chars = sum(len(text) for text in texts)\n \n for text in texts:\n for pattern_name, pattern in regex_patterns.items():\n match = re.search(pattern, text)\n if match:\n results[pattern_name]['match_count'] += 1\n results[pattern_name]['total_chars_matched'] += len(match.group(0))\n break # Only consider the first matching pattern\n \n # Calculate ratios\n for pattern_name in regex_patterns:\n results[pattern_name]['match_ratio'] = results[pattern_name]['match_count'] / total_texts\n results[pattern_name]['char_ratio'] = results[pattern_name]['total_chars_matched'] / total_chars\n \n return results\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "### 4.1. Analyze User Prompts"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "prompt_results = analyze_texts(questions, regex_patterns)\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "### 4.2. Analyze Source Code"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "code_results = analyze_texts(code_texts, regex_patterns)\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "## 5. Results and Statistics\n\nWe'll present the statistics in a table for easy comparison."
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "# Create DataFrame for prompts\nprompt_df = pd.DataFrame.from_dict(prompt_results, orient='index')\nprompt_df['Type'] = 'Prompt'\n\n# Create DataFrame for code\ncode_df = pd.DataFrame.from_dict(code_results, orient='index')\ncode_df['Type'] = 'Code'\n\n# Combine DataFrames\nresults_df = pd.concat([prompt_df, code_df])\n\n# Reset index\nresults_df = results_df.reset_index().rename(columns={'index': 'Pattern'})\n\n# Display the results\nresults_df[['Pattern', 'Type', 'match_count', 'match_ratio', 'total_chars_matched', 'char_ratio']]\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "### 5.1. Interpretation\n\nFrom the results, we can observe which regex patterns are more effective at matching user prompts versus source code. Patterns with a higher match ratio in source code compared to prompts may be good candidates for features that identify source code."
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "## 6. Visualization\n\nLet's visualize the match ratios for prompts and code."
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "# Pivot the DataFrame for plotting\npivot_df = results_df.pivot(index='Pattern', columns='Type', values='match_ratio')\n\n# Plot\npivot_df.plot(kind='bar', figsize=(12,6))\nplt.ylabel('Match Ratio')\nplt.title('Regex Pattern Match Ratios for Prompts vs Source Code')\nplt.show()\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "## 7. Conclusion\n\nBy analyzing the match ratios and character ratios of different regex patterns on user prompts and source code, we can identify features that are effective in distinguishing between the two. For example, patterns like 'Contains def statement' or 'Contains import statement' may have a higher match ratio in source code, making them good features for classification."
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"name": "python",
	"version": ""
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}