Last active
July 2, 2023 08:57
-
-
Save Orbifold/190a2d8caaf299799b002954d4ff5e27 to your computer and use it in GitHub Desktop.
Cora explorations
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "dbeead4f", | |
"metadata": {}, | |
"source": [{ | |
"cells": [{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "dbeead4f", | |
"metadata": {}, | |
"source": [ | |
"# Cora\n", | |
"\n", | |
"A collection of Cora explorations. The focus is on machine learning and not visualization here. Jupyter notebooks are not adequate for visualization, see however the [yFiles Jupyter plugin](https://www.yworks.com/products/yfiles-graphs-for-jupyter).\n", | |
"\n", | |
"*Author*: Francois Vanderseypen, Orbifold Consulting (https://orbifold.net).<br>\n", | |
"*Article*: https://graphsandnetworks.com/the-cora-dataset<br>\n", | |
"*Last update*: July 2023.<br>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f59e42e0", | |
"metadata": {}, | |
"source": [ | |
"## Download data\n", | |
"\n", | |
"This part is common to all packages, it downloads and unpacks the necessary data:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 58, | |
"id": "a06b6d68", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import os\n", | |
"import pandas as pd\n", | |
"data_dir = os.path.expanduser(\"~/cora\")\n", | |
"if not os.path.exists(data_dir):\n", | |
" os.makedirs(data_dir)\n", | |
"import requests\n", | |
"\n", | |
" \n", | |
"cora_tgz = os.path.join(data_dir, \"cora.tgz\")\n", | |
"response = requests.get(\"https://temprl.com/cora.tgz\", stream = True)\n", | |
"with open(cora_tgz,'wb') as output:\n", | |
" output.write(response.content)\n", | |
"\n", | |
"import tarfile\n", | |
"with tarfile.open(cora_tgz) as z:\n", | |
" for member in z:\n", | |
" if member.isdir():\n", | |
" continue\n", | |
" fname = member.name.rsplit('/',1)[1]\n", | |
" z.makefile(member,data_dir + '/' + fname)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "aa3525da", | |
"metadata": {}, | |
"source": [ | |
"## NetworkX" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "9bd8820a", | |
"metadata": {}, | |
"source": [ | |
"NetworkX is the most common graph package in Python. It does not perform any machine learning but it has a very complete graph analysis API and performs well on small and medium datasets." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 59, | |
"id": "60559b49", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import networkx as nx\n", | |
"\n", | |
"edge_data = pd.read_csv(os.path.join(data_dir, \"cora.cites\"), sep='\\t', header=None, names=[\"target\", \"source\"])\n", | |
"edge_data[\"label\"] = \"cites\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "70c0a3a1", | |
"metadata": {}, | |
"source": [ | |
"The edge list is just a source-target couple and there is no payload:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 60, | |
"id": "ba16ed6c", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>target</th>\n", | |
" <th>source</th>\n", | |
" <th>label</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>741</th>\n", | |
" <td>3191</td>\n", | |
" <td>1127530</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4347</th>\n", | |
" <td>162080</td>\n", | |
" <td>1109830</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3549</th>\n", | |
" <td>69198</td>\n", | |
" <td>231198</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>209</th>\n", | |
" <td>114</td>\n", | |
" <td>91975</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4766</th>\n", | |
" <td>289085</td>\n", | |
" <td>689152</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" target source label\n", | |
"741 3191 1127530 cites\n", | |
"4347 162080 1109830 cites\n", | |
"3549 69198 231198 cites\n", | |
"209 114 91975 cites\n", | |
"4766 289085 689152 cites" | |
] | |
}, | |
"execution_count": 60, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"edge_data.sample(frac=1).head(5)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 61, | |
"id": "658349e2", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"Gnx = nx.from_pandas_edgelist(edge_data, edge_attr=\"label\")\n", | |
"nx.set_node_attributes(Gnx, \"paper\", \"label\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 62, | |
"id": "f3e560f3", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'label': 'paper'}" | |
] | |
}, | |
"execution_count": 62, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
" Gnx.nodes[1103985]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 63, | |
"id": "a9cafdda", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"feature_names = [\"w_{}\".format(ii) for ii in range(1433)]\n", | |
"column_names = feature_names + [\"subject\"]\n", | |
"node_data = pd.read_csv(os.path.join(data_dir, \"cora.content\"), sep='\\t', header=None, names=column_names)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "42c07961", | |
"metadata": {}, | |
"source": [ | |
"The payload on the node consists of the weights with the subject label:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 64, | |
"id": "c27ee8a9", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>w_0</th>\n", | |
" <th>w_1</th>\n", | |
" <th>w_2</th>\n", | |
" <th>w_3</th>\n", | |
" <th>w_4</th>\n", | |
" <th>w_5</th>\n", | |
" <th>w_6</th>\n", | |
" <th>w_7</th>\n", | |
" <th>w_8</th>\n", | |
" <th>w_9</th>\n", | |
" <th>...</th>\n", | |
" <th>w_1424</th>\n", | |
" <th>w_1425</th>\n", | |
" <th>w_1426</th>\n", | |
" <th>w_1427</th>\n", | |
" <th>w_1428</th>\n", | |
" <th>w_1429</th>\n", | |
" <th>w_1430</th>\n", | |
" <th>w_1431</th>\n", | |
" <th>w_1432</th>\n", | |
" <th>subject</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>31336</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Neural_Networks</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1061127</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Rule_Learning</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1106406</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13195</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>37879</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Probabilistic_Methods</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>5 rows × 1434 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" w_0 w_1 w_2 w_3 w_4 w_5 w_6 w_7 w_8 w_9 ... w_1424 \\\n", | |
"31336 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"1061127 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"1106406 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"13195 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"37879 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"\n", | |
" w_1425 w_1426 w_1427 w_1428 w_1429 w_1430 w_1431 w_1432 \\\n", | |
"31336 0 1 0 0 0 0 0 0 \n", | |
"1061127 1 0 0 0 0 0 0 0 \n", | |
"1106406 0 0 0 0 0 0 0 0 \n", | |
"13195 0 0 0 0 0 0 0 0 \n", | |
"37879 0 0 0 0 0 0 0 0 \n", | |
"\n", | |
" subject \n", | |
"31336 Neural_Networks \n", | |
"1061127 Rule_Learning \n", | |
"1106406 Reinforcement_Learning \n", | |
"13195 Reinforcement_Learning \n", | |
"37879 Probabilistic_Methods \n", | |
"\n", | |
"[5 rows x 1434 columns]" | |
] | |
}, | |
"execution_count": 64, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"node_data.head(5)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "61af1435", | |
"metadata": {}, | |
"source": [ | |
"There are seven subjects:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 65, | |
"id": "5f405208", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'Case_Based',\n", | |
" 'Genetic_Algorithms',\n", | |
" 'Neural_Networks',\n", | |
" 'Probabilistic_Methods',\n", | |
" 'Reinforcement_Learning',\n", | |
" 'Rule_Learning',\n", | |
" 'Theory'}" | |
] | |
}, | |
"execution_count": 65, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"set(node_data[\"subject\"])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "098bc658", | |
"metadata": {}, | |
"source": [ | |
"If you don't like the weights in multiple columns you can merge them:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 66, | |
"id": "5ff5a8b3", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"weight_column_names = node_data.columns[0:-1]\n", | |
"node_data['content'] = node_data[weight_column_names].apply(\n", | |
" lambda x: ','.join(x.dropna().astype(str)),\n", | |
" axis=1\n", | |
")\n", | |
"node_data.drop(weight_column_names, axis=1, inplace=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 84, | |
"id": "870e3fae", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>subject</th>\n", | |
" <th>content</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>31336</th>\n", | |
" <td>Neural_Networks</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1061127</th>\n", | |
" <td>Rule_Learning</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1106406</th>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13195</th>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>37879</th>\n", | |
" <td>Probabilistic_Methods</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" subject \\\n", | |
"31336 Neural_Networks \n", | |
"1061127 Rule_Learning \n", | |
"1106406 Reinforcement_Learning \n", | |
"13195 Reinforcement_Learning \n", | |
"37879 Probabilistic_Methods \n", | |
"\n", | |
" content \n", | |
"31336 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... \n", | |
"1061127 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,... \n", | |
"1106406 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... \n", | |
"13195 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... \n", | |
"37879 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... " | |
] | |
}, | |
"execution_count": 84, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"node_data.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "a1e65089", | |
"metadata": {}, | |
"source": [ | |
"Note that the content is not an embedding but is the encoded article content." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 87, | |
"id": "706dde3c", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"node_data['content'] = node_data['content'].apply(lambda x: np.array([int(i) for i in x.split(',')]))\n", | |
"# node_data.astype({'subject': 'str','content':'str'})" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "30af6bfb", | |
"metadata": {}, | |
"source": [ | |
"### Poor man's path to link prediction: Jaccard" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "ae086f04", | |
"metadata": {}, | |
"source": [ | |
"Long before graph machine learning came along, people were predicting edges using very simple algorithms. The Jaccard index (algorithm) basically looks at how the immediate neighborhood of two nodes overlap and the more they overlap the more they are likely to be connected. The idea stems from social network analysis where the more friends you share with somebody, the more likely you know each other." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "fecc7d71", | |
"metadata": {}, | |
"source": [ | |
"The following is a manual calculation for some:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "6c6e78ac", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"for u,v in list(Gnx.edges)[12:20]:\n", | |
" cnbors = list(nx.common_neighbors(Gnx, u, v))\n", | |
" union_size = len(set(Gnx[u]) | set(Gnx[v])) \n", | |
" print(u,v, len(cnbors)/union_size)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "19be9453", | |
"metadata": {}, | |
"source": [ | |
"Using NetworkX you can do the whole graph in one go:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "60c3f13d", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"jaccard_predictions = list(nx.jaccard_coefficient(Gnx))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "e1a7c6e6", | |
"metadata": {}, | |
"source": [ | |
"Filtering out the most likely candidates:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "0453540a", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"jaccard_predictions_top = [(t[0],t[1]) for t in jaccard_predictions if t[2]>0.8]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "ce922f59", | |
"metadata": {}, | |
"source": [ | |
"Note that none of these are existing edges in the graph:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "de9a9463", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"[t for t in jaccard_predictions_top if Gnx.has_edge(*t)]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "cb53b0aa", | |
"metadata": {}, | |
"source": [ | |
"There are plenty of nodes which have a fully common neighborhood leading to a probability equal to one. The only case with a partially overlapping neighborhood is the following:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "e2337f85", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
" [t for t in jaccard_predictions if t[2]>0.8 and t[2]!=1]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "463be756", | |
"metadata": {}, | |
"source": [ | |
"You can see that they differ in a single node:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "b460e857", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"print(\"Common: \",sorted(nx.common_neighbors(Gnx, 14428, 14430)))\n", | |
"print(\"14428:\", list(nx.neighbors(Gnx,14428)))\n", | |
"print(\"14430:\", list(nx.neighbors(Gnx,14430)))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "03d71b71", | |
"metadata": {}, | |
"source": [ | |
"The main problem with Jaccard is the fact that it does not take the payload into account, only the immediate topology is looked at. Even the topology, it's only the first hop and maybe node neighborhoods on a higher level have a lot in common.\n", | |
"This makes Jaccard indicative rather than reliable." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "cc174736", | |
"metadata": {}, | |
"source": [ | |
"### Cosine similarity of the payload\n", | |
"\n", | |
"We can look at the payload only and see whether the existing links are correlated with the payload.\n", | |
"The content can be seen a vectors and if we take the cosine similarity we get:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 108, | |
"id": "f5598ee9", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from random import sample\n", | |
"import numpy as np\n", | |
"sample_size = 1000\n", | |
"connected_sample=sample(list(Gnx.edges), sample_size)\n", | |
"disconnected_sample=sample(list(nx.complement(Gnx).edges), sample_size)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 100, | |
"id": "66726843", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"assert len([e for e in disconnected_sample if Gnx.has_edge(*e)])==0" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "30637e4e", | |
"metadata": {}, | |
"source": [ | |
"The cosine similarity is simply the dot product of the normalized vectors:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 101, | |
"id": "67b684a0", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def cosine(u,v):\n", | |
" uc = node_data.loc[u][\"content\"]\n", | |
" vc = node_data.loc[v][\"content\"]\n", | |
" return np.dot(uc, vc)/(np.linalg.norm(uc)*np.linalg.norm(vc))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "3095e532", | |
"metadata": {}, | |
"source": [ | |
"So, for the connected subset we get:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 109, | |
"id": "42b5311b", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"connected_cosine = pd.DataFrame({\"source\":[e[0] for e in connected_sample],\"target\":[e[1] for e in connected_sample]})\n", | |
"connected_cosine[\"cosine\"]= connected_cosine.apply(lambda row: cosine(row.source,row.target), axis=1)\n", | |
"# connected_cosine" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 110, | |
"id": "e2a558c7", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<Axes: >" | |
] | |
}, | |
"execution_count": 110, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<Figure size 640x480 with 1 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"connected_cosine.cosine.plot()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "0baec06b", | |
"metadata": {}, | |
"source": [ | |
"With an average cosine:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 111, | |
"id": "618ebbec", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.16600344225511338" | |
] | |
}, | |
"execution_count": 111, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.average(connected_cosine.cosine)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 112, | |
"id": "a7bdd2ac", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"disconnected_cosine = pd.DataFrame({\"source\":[e[0] for e in disconnected_sample],\"target\":[e[1] for e in disconnected_sample]})\n", | |
"disconnected_cosine[\"cosine\"]= disconnected_cosine.apply(lambda row: cosine(row.source,row.target), axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "fc647f1f", | |
"metadata": {}, | |
"source": [ | |
"Visual inspection reveals that the disconnected nodes have on average a lower consine:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 113, | |
"id": "572b9faf", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<Axes: >" | |
] | |
}, | |
"execution_count": 113, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<Figure size 640x480 with 1 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"disconnected_cosine.cosine.plot()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "16d4f01a", | |
"metadata": {}, | |
"source": [ | |
"On average the cosine similarity is three times higher for connected nodes:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 115, | |
"id": "10361735", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"2.8722134876583296" | |
] | |
}, | |
"execution_count": 115, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.average(connected_cosine.cosine)/np.average(disconnected_cosine.cosine)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "70d7e475", | |
"metadata": {}, | |
"source": [ | |
"This shows that connectivity is correlated with the payload of the nodes and can be used to predict links." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "c3ce4f38", | |
"metadata": {}, | |
"source": [ | |
"Let's see how predictive this is. Looking at the plot you can see that cosine similarity above 0.1 seems to be a threshold:\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 145, | |
"id": "57cf7449", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"cosine_threshold = 0.1\n", | |
"connected_cosine[\"cosine_prediction\"]= connected_cosine.apply(lambda row: row.cosine>cosine_threshold, axis=1)\n", | |
"disconnected_cosine[\"cosine_prediction\"]= disconnected_cosine.apply(lambda row: row.cosine>cosine_threshold, axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "a3df371d", | |
"metadata": {}, | |
"source": [ | |
"The confusion matrix for this prediction can be assembled as follows:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 146, | |
"id": "fc21cead", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<Figure size 640x480 with 2 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"import seaborn as sn\n", | |
"import pandas as pd\n", | |
"import matplotlib.pyplot as plt\n", | |
"confusion_01 = disconnected_cosine[\"cosine_prediction\"].sum()\n", | |
"confusion_00 = sample_size - confusion_01\n", | |
"confusion_11 = connected_cosine[\"cosine_prediction\"].sum()\n", | |
"confusion_10 = sample_size - confusion_11\n", | |
"array = np.array([[confusion_00, confusion_01],\n", | |
" [confusion_10, confusion_11]\n", | |
" ])*100/sample_size\n", | |
"\n", | |
"df_cm = pd.DataFrame(array, range(2), range(2))\n", | |
"# plt.figure(figsize=(10,7))\n", | |
"sn.set(font_scale=1.4) # for label size\n", | |
"sn.heatmap(df_cm, annot=True, annot_kws={\"size\": 16}) # font size\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "d7c2c045", | |
"metadata": {}, | |
"source": [ | |
"This shows that prediction the edges purely on the basis of the node content is not too bad. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "507a5531", | |
"metadata": {}, | |
"source": [ | |
"### Combined Jaccard and cosine similarity\n", | |
"\n", | |
"It's natural to wonder whether the topological similarity improves the cosine prediction.\n", | |
"There are various ways the probabilities could be combined. Let's use the optimistic approach and take the maximum:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 164, | |
"id": "0e0d5c92", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"combined_threshold = 0.1\n", | |
"\n", | |
"connected_cosine[\"jaccard\"]= connected_cosine.apply(lambda row: list(nx.jaccard_coefficient(Gnx, [(row.source, row.target)]))[0][2], axis=1)\n", | |
"connected_cosine[\"combined_prediction\"]= connected_cosine.apply(lambda row: max(row.cosine,row.jaccard)>combined_threshold, axis=1)\n", | |
"\n", | |
"disconnected_cosine[\"jaccard\"]= disconnected_cosine.apply(lambda row: list(nx.jaccard_coefficient(Gnx, [(row.source, row.target)]))[0][2], axis=1)\n", | |
"disconnected_cosine[\"combined_prediction\"]= disconnected_cosine.apply(lambda row: max(row.cosine,row.jaccard)>combined_threshold, axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 165, | |
"id": "9710d284", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<Figure size 640x480 with 2 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"confusion_01 = disconnected_cosine[\"combined_prediction\"].sum()\n", | |
"confusion_00 = sample_size - confusion_01\n", | |
"confusion_11 = connected_cosine[\"combined_prediction\"].sum()\n", | |
"confusion_10 = sample_size - confusion_11\n", | |
"array = np.array([[confusion_00, confusion_01],\n", | |
" [confusion_10, confusion_11]\n", | |
" ])*100/sample_size\n", | |
"\n", | |
"df_cm = pd.DataFrame(array, range(2), range(2))\n", | |
"# plt.figure(figsize=(10,7))\n", | |
"sn.set(font_scale=1.4) # for label size\n", | |
"sn.heatmap(df_cm, annot=True, annot_kws={\"size\": 16}) # font size\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "57644303", | |
"metadata": {}, | |
"source": [ | |
"This gives **an accuracy of 0.76** (= (78+74)/200). Altogether not a bad score for this simplistic approach." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "558d0ab2", | |
"metadata": {}, | |
"source": [ | |
"## Node2Vec\n", | |
"\n", | |
"The node2vec embedding learns low-dimensional representations for nodes in a graph through the use of random walks through a graph starting at a target node. The random walk effectively assembles 'sentences' and this allows one to use the Word2Vec mechanics. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "c810d04f", | |
"metadata": {}, | |
"source": [ | |
"Let's create a basic node2vec embedding:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 168, | |
"id": "2ac8451e", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"application/vnd.jupyter.widget-view+json": { | |
"model_id": "393df34b5b9141dda2adea339944e580", | |
"version_major": 2, | |
"version_minor": 0 | |
}, | |
"text/plain": [ | |
"Computing transition probabilities: 0%| | 0/2708 [00:00<?, ?it/s]" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"Generating walks (CPU: 1): 100%|██████████| 50/50 [00:07<00:00, 6.78it/s]\n", | |
"Generating walks (CPU: 2): 100%|██████████| 50/50 [00:07<00:00, 6.78it/s]\n", | |
"Generating walks (CPU: 3): 100%|██████████| 50/50 [00:07<00:00, 6.72it/s]\n", | |
"Generating walks (CPU: 4): 100%|██████████| 50/50 [00:07<00:00, 6.75it/s]\n" | |
] | |
} | |
], | |
"source": [ | |
"from node2vec import Node2Vec\n", | |
"node2vec = Node2Vec(Gnx, dimensions=64, walk_length=30, num_walks=200, workers=4) \n", | |
"\n", | |
"# Embed nodes\n", | |
"model = node2vec.fit(window=10, min_count=1, batch_words=4) \n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "656d9f40", | |
"metadata": {}, | |
"source": [ | |
"This will compute the cosine similarity based on the embedding." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 186, | |
"id": "2bf529ae", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def vector_cosine(u,v): \n", | |
" uc = model.wv[str(u)]\n", | |
" vc = model.wv[str(v)]\n", | |
" return np.dot(uc, vc)/(np.linalg.norm(uc)*np.linalg.norm(vc))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "2f464e61", | |
"metadata": {}, | |
"source": [ | |
"Just like in the previous approaches we'll add it to the dataframe and compute the confusion matrix:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 199, | |
"id": "de4ea4a9", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"embedding_threshold = 0.8\n", | |
"\n", | |
"connected_cosine[\"embedding\"]= connected_cosine.apply(lambda row: vector_cosine(row.source, row.target), axis=1)\n", | |
"connected_cosine[\"embedding_prediction\"]= connected_cosine.apply(lambda row: row.embedding>embedding_threshold, axis=1)\n", | |
"\n", | |
"disconnected_cosine[\"embedding\"]= disconnected_cosine.apply(lambda row: vector_cosine(row.source, row.target), axis=1)\n", | |
"disconnected_cosine[\"embedding_prediction\"]= disconnected_cosine.apply(lambda row: row.embedding>embedding_threshold, axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 200, | |
"id": "e6610194", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<Figure size 640x480 with 2 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"confusion_01 = disconnected_cosine[\"embedding_prediction\"].sum()\n", | |
"confusion_00 = sample_size - confusion_01\n", | |
"confusion_11 = connected_cosine[\"embedding_prediction\"].sum()\n", | |
"confusion_10 = sample_size - confusion_11\n", | |
"array = np.array([[confusion_00, confusion_01],\n", | |
" [confusion_10, confusion_11]\n", | |
" ])\n", | |
"\n", | |
"df_cm = pd.DataFrame(array, range(2), range(2))\n", | |
"# plt.figure(figsize=(10,7))\n", | |
"sn.set(font_scale=1.4) # for label size\n", | |
"sn.heatmap(df_cm, annot=True, annot_kws={\"size\": 16}) # font size\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "9cf5aac0", | |
"metadata": {}, | |
"source": [ | |
"This gives **an accuracy of 0.92** (=(1000+840)/2000)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "2ac24094", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.10.11" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} | |
{ | |
"cell_type": "markdown"{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "dbeead4f", | |
"metadata": {}, | |
"source": [ | |
"# Cora\n", | |
"\n", | |
"A collection of Cora explorations. The focus is on machine learning and not visualization here. Jupyter notebooks are not adequate for visualization, see however the [yFiles Jupyter plugin](https://www.yworks.com/products/yfiles-graphs-for-jupyter).\n", | |
"\n", | |
"*Author*: Francois Vanderseypen, Orbifold Consulting (https://orbifold.net).<br>\n", | |
"*Article*: https://graphsandnetworks.com/the-cora-dataset<br>\n", | |
"*Last update*: July 2023.<br>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f59e42e0", | |
"metadata": {}, | |
"source": [ | |
"## Download data\n", | |
"\n", | |
"This part is common to all packages, it downloads and unpacks the necessary data:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 58, | |
"id": "a06b6d68", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import os\n", | |
"import pandas as pd\n", | |
"data_dir = os.path.expanduser(\"~/cora\")\n", | |
"if not os.path.exists(data_dir):\n", | |
" os.makedirs(data_dir)\n", | |
"import requests\n", | |
"\n", | |
" \n", | |
"cora_tgz = os.path.join(data_dir, \"cora.tgz\")\n", | |
"response = requests.get(\"https://temprl.com/cora.tgz\", stream = True)\n", | |
"with open(cora_tgz,'wb') as output:\n", | |
" output.write(response.content)\n", | |
"\n", | |
"import tarfile\n", | |
"with tarfile.open(cora_tgz) as z:\n", | |
" for member in z:\n", | |
" if member.isdir():\n", | |
" continue\n", | |
" fname = member.name.rsplit('/',1)[1]\n", | |
" z.makefile(member,data_dir + '/' + fname)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "aa3525da", | |
"metadata": {}, | |
"source": [ | |
"## NetworkX" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "9bd8820a", | |
"metadata": {}, | |
"source": [ | |
"NetworkX is the most common graph package in Python. It does not perform any machine learning but it has a very complete graph analysis API and performs well on small and medium datasets." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 59, | |
"id": "60559b49", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import networkx as nx\n", | |
"\n", | |
"edge_data = pd.read_csv(os.path.join(data_dir, \"cora.cites\"), sep='\\t', header=None, names=[\"target\", \"source\"])\n", | |
"edge_data[\"label\"] = \"cites\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "70c0a3a1", | |
"metadata": {}, | |
"source": [ | |
"The edge list is just a source-target couple and there is no payload:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 60, | |
"id": "ba16ed6c", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>target</th>\n", | |
" <th>source</th>\n", | |
" <th>label</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>741</th>\n", | |
" <td>3191</td>\n", | |
" <td>1127530</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4347</th>\n", | |
" <td>162080</td>\n", | |
" <td>1109830</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3549</th>\n", | |
" <td>69198</td>\n", | |
" <td>231198</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>209</th>\n", | |
" <td>114</td>\n", | |
" <td>91975</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4766</th>\n", | |
" <td>289085</td>\n", | |
" <td>689152</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" target source label\n", | |
"741 3191 1127530 cites\n", | |
"4347 162080 1109830 cites\n", | |
"3549 69198 231198 cites\n", | |
"209 114 91975 cites\n", | |
"4766 289085 689152 cites" | |
] | |
}, | |
"execution_count": 60, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"edge_data.sample(frac=1).head(5)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 61, | |
"id": "658349e2", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"Gnx = nx.from_pandas_edgelist(edge_data, edge_attr=\"label\")\n", | |
"nx.set_node_attributes(Gnx, \"paper\", \"label\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 62, | |
"id": "f3e560f3", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'label': 'paper'}" | |
] | |
}, | |
"execution_count": 62, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
" Gnx.nodes[1103985]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 63, | |
"id": "a9cafdda", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"feature_names = [\"w_{}\".format(ii) for ii in range(1433)]\n", | |
"column_names = feature_names + [\"subject\"]\n", | |
"node_data = pd.read_csv(os.path.join(data_dir, \"cora.content\"), sep='\\t', header=None, names=column_names)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "42c07961", | |
"metadata": {}, | |
"source": [ | |
"The payload on the node consists of the weights with the subject label:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 64, | |
"id": "c27ee8a9", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>w_0</th>\n", | |
" <th>w_1</th>\n", | |
" <th>w_2</th>\n", | |
" <th>w_3</th>\n", | |
" <th>w_4</th>\n", | |
" <th>w_5</th>\n", | |
" <th>w_6</th>\n", | |
" <th>w_7</th>\n", | |
" <th>w_8</th>\n", | |
" <th>w_9</th>\n", | |
" <th>...</th>\n", | |
" <th>w_1424</th>\n", | |
" <th>w_1425</th>\n", | |
" <th>w_1426</th>\n", | |
" <th>w_1427</th>\n", | |
" <th>w_1428</th>\n", | |
" <th>w_1429</th>\n", | |
" <th>w_1430</th>\n", | |
" <th>w_1431</th>\n", | |
" <th>w_1432</th>\n", | |
" <th>subject</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>31336</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Neural_Networks</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1061127</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Rule_Learning</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1106406</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13195</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>37879</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Probabilistic_Methods</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>5 rows × 1434 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" w_0 w_1 w_2 w_3 w_4 w_5 w_6 w_7 w_8 w_9 ... w_1424 \\\n", | |
"31336 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"1061127 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"1106406 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"13195 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"37879 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"\n", | |
" w_1425 w_1426 w_1427 w_1428 w_1429 w_1430 w_1431 w_1432 \\\n", | |
"31336 0 1 0 0 0 0 0 0 \n", | |
"1061127 1 0 0 0 0 0 0 0 \n", | |
"1106406 0 0 0 0 0 0 0 0 \n", | |
"13195 0 0 0 0 0 0 0 0 \n", | |
"37879 0 0 0 0 0 0 0 0 \n", | |
"\n", | |
" subject \n", | |
"31336 Neural_Networks \n", | |
"1061127 Rule_Learning \n", | |
"1106406 Reinforcement_Learning \n", | |
"13195 Reinforcement_Learning \n", | |
"37879 Probabilistic_Methods \n", | |
"\n", | |
"[5 rows x 1434 columns]" | |
] | |
}, | |
"execution_count": 64, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"node_data.head(5)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "61af1435", | |
"metadata": {}, | |
"source": [ | |
"There are seven subjects:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 65, | |
"id": "5f405208", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'Case_Based',\n", | |
" 'Genetic_Algorithms',\n", | |
" 'Neural_Networks',\n", | |
" 'Probabilistic_Methods',\n", | |
" 'Reinforcement_Learning',\n", | |
" 'Rule_Learning',\n", | |
" 'Theory'}" | |
] | |
}, | |
"execution_count": 65, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"set(node_data[\"subject\"])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "098bc658", | |
"metadata": {}, | |
"source": [ | |
"If you don't like the weights in multiple columns you can merge them:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 66, | |
"id": "5ff5a8b3", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"weight_column_names = node_data.columns[0:-1]\n", | |
"node_data['content'] = node_data[weight_column_names].apply(\n", | |
" lambda x: ','.join(x.dropna().astype(str)),\n", | |
" axis=1\n", | |
")\n", | |
"node_data.drop(weight_column_names, axis=1, inplace=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 84, | |
"id": "870e3fae", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>subject</th>\n", | |
" <th>content</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>31336</th>\n", | |
" <td>Neural_Networks</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1061127</th>\n", | |
" <td>Rule_Learning</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1106406</th>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13195</th>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>37879</th>\n", | |
" <td>Probabilistic_Methods</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" subject \\\n", | |
"31336 Neural_Networks \n", | |
"1061127 Rule_Learning \n", | |
"1106406 Reinforcement_Learning \n", | |
"13195 Reinforcement_Learning \n", | |
"37879 Probabilistic_Methods \n", | |
"\n", | |
" content \n", | |
"31336 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... \n", | |
"1061127 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,... \n", | |
"1106406 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... \n", | |
"13195 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... \n", | |
"37879 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... " | |
] | |
}, | |
"execution_count": 84, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"node_data.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "a1e65089", | |
"metadata": {}, | |
"source": [ | |
"Note that the content is not an embedding but is the encoded article content." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 87, | |
"id": "3a6fa0f7", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"node_data['content'] = node_data['content'].apply(lambda x: np.array([int(i) for i in x.split(',')]))\n", | |
"# node_data.astype({'subject': 'str','content':'str'})" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "30af6bfb", | |
"metadata": {}, | |
"source": [ | |
"### Poor man's path to link prediction: Jaccard" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "ae086f04", | |
"metadata": {}, | |
"source": [ | |
"Long before graph machine learning came along, people were predicting edges using very simple algorithms. The Jaccard index (algorithm) basically looks at how the immediate neighborhood of two nodes overlap and the more they overlap the more they are likely to be connected. The idea stems from social network analysis where the more friends you share with somebody, the more likely you know each other." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "fecc7d71", | |
"metadata": {}, | |
"source": [ | |
"The following is a manual calculation for some:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "6c6e78ac", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"for u,v in list(Gnx.edges)[12:20]:\n", | |
" cnbors = list(nx.common_neighbors(Gnx, u, v))\n", | |
" union_size = len(set(Gnx[u]) | set(Gnx[v])) \n", | |
" print(u,v, len(cnbors)/union_size)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "19be9453", | |
"metadata": {}, | |
"source": [ | |
"Using NetworkX you can do the whole graph in one go:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "60c3f13d", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"jaccard_predictions = list(nx.jaccard_coefficient(Gnx))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "e1a7c6e6", | |
"metadata": {}, | |
"source": [ | |
"Filtering out the most likely candidates:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "0453540a", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"jaccard_predictions_top = [(t[0],t[1]) for t in jaccard_predictions if t[2]>0.8]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "ce922f59", | |
"metadata": {}, | |
"source": [ | |
"Note that none of these are existing edges in the graph:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "de9a9463", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"[t for t in jaccard_predictions_top if Gnx.has_edge(*t)]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "e30e0277", | |
"metadata": {}, | |
"source": [ | |
"There are plenty of nodes which have a fully common neighborhood leading to a probability equal to one. The only case with a partially overlapping neighborhood is the following:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "e2337f85", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
" [t for t in jaccard_predictions if t[2]>0.8 and t[2]!=1]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "9b6c660d", | |
"metadata": {}, | |
"source": [ | |
"You can see that they differ in a single node:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "fd797519", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"print(\"Common: \",sorted(nx.common_neighbors(Gnx, 14428, 14430)))\n", | |
"print(\"14428:\", list(nx.neighbors(Gnx,14428)))\n", | |
"print(\"14430:\", list(nx.neighbors(Gnx,14430)))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "b912f361", | |
"metadata": {}, | |
"source": [ | |
"The main problem with Jaccard is the fact that it does not take the payload into account, only the immediate topology is looked at. Even the topology, it's only the first hop and maybe node neighborhoods on a higher level have a lot in common.\n", | |
"This makes Jaccard indicative rather than reliable." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "d121db6f", | |
"metadata": {}, | |
"source": [ | |
"### Cosine similarity of the payload\n", | |
"\n", | |
"We can look at the payload only and see whether the existing links are correlated with the payload.\n", | |
"The content can be seen a vectors and if we take the cosine similarity we get:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 108, | |
"id": "ffff5321", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from random import sample\n", | |
"import numpy as np\n", | |
"sample_size = 1000\n", | |
"connected_sample=sample(list(Gnx.edges), sample_size)\n", | |
"disconnected_sample=sample(list(nx.complement(Gnx).edges), sample_size)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 100, | |
"id": "5606b585", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"assert len([e for e in disconnected_sample if Gnx.has_edge(*e)])==0" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "b1563272", | |
"metadata": {}, | |
"source": [ | |
"The cosine similarity is simply the dot product of the normalized vectors:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 101, | |
"id": "d21e9952", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def cosine(u,v):\n", | |
" uc = node_data.loc[u][\"content\"]\n", | |
" vc = node_data.loc[v][\"content\"]\n", | |
" return np.dot(uc, vc)/(np.linalg.norm(uc)*np.linalg.norm(vc))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "24bfdee6", | |
"metadata": {}, | |
"source": [ | |
"So, for the connected subset we get:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 109, | |
"id": "2fa25891", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"connected_cosine = pd.DataFrame({\"source\":[e[0] for e in connected_sample],\"target\":[e[1] for e in connected_sample]})\n", | |
"connected_cosine[\"cosine\"]= connected_cosine.apply(lambda row: cosine(row.source,row.target), axis=1)\n", | |
"# connected_cosine" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 110, | |
"id": "b8564d60", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<Axes: >" | |
] | |
}, | |
"execution_count": 110, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<Figure size 640x480 with 1 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"connected_cosine.cosine.plot()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f4c1c7d1", | |
"metadata": {}, | |
"source": [ | |
"With an average cosine:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 111, | |
"id": "b9d1ff3e", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.16600344225511338" | |
] | |
}, | |
"execution_count": 111, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.average(connected_cosine.cosine)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 112, | |
"id": "a9da3ce4", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"disconnected_cosine = pd.DataFrame({\"source\":[e[0] for e in disconnected_sample],\"target\":[e[1] for e in disconnected_sample]})\n", | |
"disconnected_cosine[\"cosine\"]= disconnected_cosine.apply(lambda row: cosine(row.source,row.target), axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f55ea8d3", | |
"metadata": {}, | |
"source": [ | |
"Visual inspection reveals that the disconnected nodes have on average a lower consine:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 113, | |
"id": "32edfcd0", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<Axes: >" | |
] | |
}, | |
"execution_count": 113, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<Figure size 640x480 with 1 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"disconnected_cosine.cosine.plot()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "e80b3341", | |
"metadata": {}, | |
"source": [ | |
"On average the cosine similarity is three times higher for connected nodes:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 115, | |
"id": "549153eb", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"2.8722134876583296" | |
] | |
}, | |
"execution_count": 115, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.average(connected_cosine.cosine)/np.average(disconnected_cosine.cosine)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "8fd35200", | |
"metadata": {}, | |
"source": [ | |
"This shows that connectivity is correlated with the payload of the nodes and can be used to predict links." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "62a78970", | |
"metadata": {}, | |
"source": [ | |
"Let's see how predictive this is. Looking at the plot you can see that cosine similarity above 0.1 seems to be a threshold:\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 145, | |
"id": "b53f0f27", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"cosine_threshold = 0.1\n", | |
"connected_cosine[\"cosine_prediction\"]= connected_cosine.apply(lambda row: row.cosine>cosine_threshold, axis=1)\n", | |
"disconnected_cosine[\"cosine_prediction\"]= disconnected_cosine.apply(lambda row: row.cosine>cosine_threshold, axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "47834bb8", | |
"metadata": {}, | |
"source": [ | |
"The confusion matrix for this prediction can be assembled as follows:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 146, | |
"id": "ad98c2db", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<Figure size 640x480 with 2 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"import seaborn as sn\n", | |
"import pandas as pd\n", | |
"import matplotlib.pyplot as plt\n", | |
"confusion_01 = disconnected_cosine[\"cosine_prediction\"].sum()\n", | |
"confusion_00 = sample_size - confusion_01\n", | |
"confusion_11 = connected_cosine[\"cosine_prediction\"].sum()\n", | |
"confusion_10 = sample_size - confusion_11\n", | |
"array = np.array([[confusion_00, confusion_01],\n", | |
" [confusion_10, confusion_11]\n", | |
" ])*100/sample_size\n", | |
"\n", | |
"df_cm = pd.DataFrame(array, range(2), range(2))\n", | |
"# plt.figure(figsize=(10,7))\n", | |
"sn.set(font_scale=1.4) # for label size\n", | |
"sn.heatmap(df_cm, annot=True, annot_kws={\"size\": 16}) # font size\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "aab1575e", | |
"metadata": {}, | |
"source": [ | |
"This shows that prediction the edges purely on the basis of the node content is not too bad. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "53a11790", | |
"metadata": {}, | |
"source": [ | |
"### Combined Jaccard and cosine similarity\n", | |
"\n", | |
"It's natural to wonder whether the topological similarity improves the cosine prediction.\n", | |
"There are various ways the probabilities could be combined. Let's use the optimistic approach and take the maximum:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 164, | |
"id": "e8e2f8a6", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"combined_threshold = 0.1\n", | |
"\n", | |
"connected_cosine[\"jaccard\"]= connected_cosine.apply(lambda row: list(nx.jaccard_coefficient(Gnx, [(row.source, row.target)]))[0][2], axis=1)\n", | |
"connected_cosine[\"combined_prediction\"]= connected_cosine.apply(lambda row: max(row.cosine,row.jaccard)>combined_threshold, axis=1)\n", | |
"\n", | |
"disconnected_cosine[\"jaccard\"]= disconnected_cosine.apply(lambda row: list(nx.jaccard_coefficient(Gnx, [(row.source, row.target)]))[0][2], axis=1)\n", | |
"disconnected_cosine[\"combined_prediction\"]= disconnected_cosine.apply(lambda row: max(row.cosine,row.jaccard)>combined_threshold, axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 165, | |
"id": "4aae7db3", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<Figure size 640x480 with 2 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"confusion_01 = disconnected_cosine[\"combined_prediction\"].sum()\n", | |
"confusion_00 = sample_size - confusion_01\n", | |
"confusion_11 = connected_cosine[\"combined_prediction\"].sum()\n", | |
"confusion_10 = sample_size - confusion_11\n", | |
"array = np.array([[confusion_00, confusion_01],\n", | |
" [confusion_10, confusion_11]\n", | |
" ])*100/sample_size\n", | |
"\n", | |
"df_cm = pd.DataFrame(array, range(2), range(2))\n", | |
"# plt.figure(figsize=(10,7))\n", | |
"sn.set(font_scale=1.4) # for label size\n", | |
"sn.heatmap(df_cm, annot=True, annot_kws={\"size\": 16}) # font size\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "dbdaa901", | |
"metadata": {}, | |
"source": [ | |
"This gives **an accuracy of 0.76** (= (78+74)/200). Altogether not a bad score for this simplistic approach." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "99e5c3d2", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.10.11" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} | |
, | |
"id": "dbeead4f", | |
"metadata": {}, | |
"source": [ | |
"# Cora\n", | |
"\n", | |
"A collection of Cora explorations. The focus is on machine learning and not visualization here. Jupyter notebooks are not adequate for visualization, see however the [yFiles Jupyter plugin](https://www.yworks.com/products/yfiles-graphs-for-jupyter).\n", | |
"\n", | |
"*Author*: Francois Vanderseypen, Orbifold Consulting (https://orbifold.net).<br>\n", | |
"*Article*: https://graphsandnetworks.com/the-cora-dataset<br>\n", | |
"*Last update*: July 2023.<br>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f59e42e0", | |
"metadata": {}, | |
"source": [ | |
"## Download data\n", | |
"\n", | |
"This part is common to all packages, it downloads and unpacks the necessary data:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 58, | |
"id": "a06b6d68", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import os\n", | |
"import pandas as pd\n", | |
"data_dir = os.path.expanduser(\"~/cora\")\n", | |
"if not os.path.exists(data_dir):\n", | |
" os.makedirs(data_dir)\n", | |
"import requests\n", | |
"\n", | |
" \n", | |
"cora_tgz = os.path.join(data_dir, \"cora.tgz\")\n", | |
"response = requests.get(\"https://temprl.com/cora.tgz\", stream = True)\n", | |
"with open(cora_tgz,'wb') as output:\n", | |
" output.write(response.content)\n", | |
"\n", | |
"import tarfile\n", | |
"with tarfile.open(cora_tgz) as z:\n", | |
" for member in z:\n", | |
" if member.isdir():\n", | |
" continue\n", | |
" fname = member.name.rsplit('/',1)[1]\n", | |
" z.makefile(member,data_dir + '/' + fname)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "aa3525da", | |
"metadata": {}, | |
"source": [ | |
"## NetworkX" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "9bd8820a", | |
"metadata": {}, | |
"source": [ | |
"NetworkX is the most common graph package in Python. It does not perform any machine learning but it has a very complete graph analysis API and performs well on small and medium datasets." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 59, | |
"id": "60559b49", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import networkx as nx\n", | |
"\n", | |
"edge_data = pd.read_csv(os.path.join(data_dir, \"cora.cites\"), sep='\\t', header=None, names=[\"target\", \"source\"])\n", | |
"edge_data[\"label\"] = \"cites\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "70c0a3a1", | |
"metadata": {}, | |
"source": [ | |
"The edge list is just a source-target couple and there is no payload:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 60, | |
"id": "ba16ed6c", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>target</th>\n", | |
" <th>source</th>\n", | |
" <th>label</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>741</th>\n", | |
" <td>3191</td>\n", | |
" <td>1127530</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4347</th>\n", | |
" <td>162080</td>\n", | |
" <td>1109830</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3549</th>\n", | |
" <td>69198</td>\n", | |
" <td>231198</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>209</th>\n", | |
" <td>114</td>\n", | |
" <td>91975</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4766</th>\n", | |
" <td>289085</td>\n", | |
" <td>689152</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" target source label\n", | |
"741 3191 1127530 cites\n", | |
"4347 162080 1109830 cites\n", | |
"3549 69198 231198 cites\n", | |
"209 114 91975 cites\n", | |
"4766 289085 689152 cites" | |
] | |
}, | |
"execution_count": 60, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"edge_data.sample(frac=1).head(5)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 61, | |
"id": "658349e2", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"Gnx = nx.from_pandas_edgelist(edge_data, edge_attr=\"label\")\n", | |
"nx.set_node_attributes(Gnx, \"paper\", \"label\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 62, | |
"id": "f3e560f3", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'label': 'paper'}" | |
] | |
}, | |
"execution_count": 62, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
" Gnx.nodes[1103985]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 63, | |
"id": "a9cafdda", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"feature_names = [\"w_{}\".format(ii) for ii in range(1433)]\n", | |
"column_names = feature_names + [\"subject\"]\n", | |
"node_data = pd.read_csv(os.path.join(data_dir, \"cora.content\"), sep='\\t', header=None, names=column_names)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "42c07961", | |
"metadata": {}, | |
"source": [ | |
"The payload on the node consists of the weights with the subject label:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 64, | |
"id": "c27ee8a9", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>w_0</th>\n", | |
" <th>w_1</th>\n", | |
" <th>w_2</th>\n", | |
" <th>w_3</th>\n", | |
" <th>w_4</th>\n", | |
" <th>w_5</th>\n", | |
" <th>w_6</th>\n", | |
" <th>w_7</th>\n", | |
" <th>w_8</th>\n", | |
" <th>w_9</th>\n", | |
" <th>...</th>\n", | |
" <th>w_1424</th>\n", | |
" <th>w_1425</th>\n", | |
" <th>w_1426</th>\n", | |
" <th>w_1427</th>\n", | |
" <th>w_1428</th>\n", | |
" <th>w_1429</th>\n", | |
" <th>w_1430</th>\n", | |
" <th>w_1431</th>\n", | |
" <th>w_1432</th>\n", | |
" <th>subject</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>31336</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Neural_Networks</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1061127</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Rule_Learning</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1106406</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13195</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>37879</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Probabilistic_Methods</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>5 rows × 1434 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" w_0 w_1 w_2 w_3 w_4 w_5 w_6 w_7 w_8 w_9 ... w_1424 \\\n", | |
"31336 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"1061127 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"1106406 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"13195 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"37879 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"\n", | |
" w_1425 w_1426 w_1427 w_1428 w_1429 w_1430 w_1431 w_1432 \\\n", | |
"31336 0 1 0 0 0 0 0 0 \n", | |
"1061127 1 0 0 0 0 0 0 0 \n", | |
"1106406 0 0 0 0 0 0 0 0 \n", | |
"13195 0 0 0 0 0 0 0 0 \n", | |
"37879 0 0 0 0 0 0 0 0 \n", | |
"\n", | |
" subject \n", | |
"31336 Neural_Networks \n", | |
"1061127 Rule_Learning \n", | |
"1106406 Reinforcement_Learning \n", | |
"13195 Reinforcement_Learning \n", | |
"37879 Probabilistic_Methods \n", | |
"\n", | |
"[5 rows x 1434 columns]" | |
] | |
}, | |
"execution_count": 64, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"node_data.head(5)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "61af1435", | |
"metadata": {}, | |
"source": [ | |
"There are seven subjects:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 65, | |
"id": "5f405208", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'Case_Based',\n", | |
" 'Genetic_Algorithms',\n", | |
" 'Neural_Networks',\n", | |
" 'Probabilistic_Methods',\n", | |
" 'Reinforcement_Learning',\n", | |
" 'Rule_Learning',\n", | |
" 'Theory'}" | |
] | |
}, | |
"execution_count": 65, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"set(node_data[\"subject\"])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "098bc658", | |
"metadata": {}, | |
"source": [ | |
"If you don't like the weights in multiple columns you can merge them:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 66, | |
"id": "5ff5a8b3", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"weight_column_names = node_data.columns[0:-1]\n", | |
"node_data['content'] = node_data[weight_column_names].apply(\n", | |
" lambda x: ','.join(x.dropna().astype(str)),\n", | |
" axis=1\n", | |
")\n", | |
"node_data.drop(weight_column_names, axis=1, inplace=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 84, | |
"id": "870e3fae", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>subject</th>\n", | |
" <th>content</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>31336</th>\n", | |
" <td>Neural_Networks</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1061127</th>\n", | |
" <td>Rule_Learning</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1106406</th>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13195</th>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>37879</th>\n", | |
" <td>Probabilistic_Methods</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" subject \\\n", | |
"31336 Neural_Networks \n", | |
"1061127 Rule_Learning \n", | |
"1106406 Reinforcement_Learning \n", | |
"13195 Reinforcement_Learning \n", | |
"37879 Probabilistic_Methods \n", | |
"\n", | |
" content \n", | |
"31336 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... \n", | |
"1061127 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,... \n", | |
"1106406 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... \n", | |
"13195 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... \n", | |
"37879 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... " | |
] | |
}, | |
"execution_count": 84, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"node_data.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "a1e65089", | |
"metadata": {}, | |
"source": [ | |
"Note that the content is not an embedding but is the encoded article content." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 87, | |
"id": "86e48586", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"node_data['content'] = node_data['content'].apply(lambda x: np.array([int(i) for i in x.split(',')]))\n", | |
"# node_data.astype({'subject': 'str','content':'str'})" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "30af6bfb", | |
"metadata": {}, | |
"source": [ | |
"### Poor man's path to link prediction: Jaccard" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "ae086f04", | |
"metadata": {}, | |
"source": [ | |
"Long before graph machine learning came along, people were predicting edges using very simple algorithms. The Jaccard index (algorithm) basically looks at how the immediate neighborhood of two nodes overlap and the more they overlap the more they are likely to be connected. The idea stems from social network analysis where the more friends you share with somebody, the more likely you know each other." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "fecc7d71", | |
"metadata": {}, | |
"source": [ | |
"The following is a manual calculation for some:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "6c6e78ac", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"for u,v in list(Gnx.edges)[12:20]:\n", | |
" cnbors = list(nx.common_neighbors(Gnx, u, v))\n", | |
" union_size = len(set(Gnx[u]) | set(Gnx[v])) \n", | |
" print(u,v, len(cnbors)/union_size)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "19be9453", | |
"metadata": {}, | |
"source": [ | |
"Using NetworkX you can do the whole graph in one go:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "60c3f13d", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"jaccard_predictions = list(nx.jaccard_coefficient(Gnx))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "e1a7c6e6", | |
"metadata": {}, | |
"source": [ | |
"Filtering out the most likely candidates:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "0453540a", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"jaccard_predictions_top = [(t[0],t[1]) for t in jaccard_predictions if t[2]>0.8]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "ce922f59", | |
"metadata": {}, | |
"source": [ | |
"Note that none of these are existing edges in the graph:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "de9a9463", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"[t for t in jaccard_predictions_top if Gnx.has_edge(*t)]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "2e7e5486", | |
"metadata": {}, | |
"source": [ | |
"There are plenty of nodes which have a fully common neighborhood leading to a probability equal to one. The only case with a partially overlapping neighborhood is the following:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "e2337f85", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
" [t for t in jaccard_predictions if t[2]>0.8 and t[2]!=1]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "d3810102", | |
"metadata": {}, | |
"source": [ | |
"You can see that they differ in a single node:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "479f1254", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"print(\"Common: \",sorted(nx.common_neighbors(Gnx, 14428, 14430)))\n", | |
"print(\"14428:\", list(nx.neighbors(Gnx,14428)))\n", | |
"print(\"14430:\", list(nx.neighbors(Gnx,14430)))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "625686cd", | |
"metadata": {}, | |
"source": [ | |
"The main problem with Jaccard is the fact that it does not take the payload into account, only the immediate topology is looked at. Even the topology, it's only the first hop and maybe node neighborhoods on a higher level have a lot in common.\n", | |
"This makes Jaccard indicative rather than reliable." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "d6689c9a", | |
"metadata": {}, | |
"source": [ | |
"### Cosine similarity of the payload\n", | |
"\n", | |
"We can look at the payload only and see whether the existing links are correlated with the payload.\n", | |
"The content can be seen a vectors and if we take the cosine similarity we get:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 108, | |
"id": "d78ca209", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from random import sample\n", | |
"import numpy as np\n", | |
"sample_size = 1000\n", | |
"connected_sample=sample(list(Gnx.edges), sample_size)\n", | |
"disconnected_sample=sample(list(nx.complement(Gnx).edges), sample_size)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 100, | |
"id": "b3c4df0e", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"assert len([e for e in disconnected_sample if Gnx.has_edge(*e)])==0" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "37385f91", | |
"metadata": {}, | |
"source": [ | |
"The cosine similarity is simply the dot product of the normalized vectors:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 101, | |
"id": "3cbc816b", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def cosine(u,v):\n", | |
" uc = node_data.loc[u][\"content\"]\n", | |
" vc = node_data.loc[v][\"content\"]\n", | |
" return np.dot(uc, vc)/(np.linalg.norm(uc)*np.linalg.norm(vc))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "32f006e5", | |
"metadata": {}, | |
"source": [ | |
"So, for the connected subset we get:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 109, | |
"id": "2033b5e2", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"connected_cosine = pd.DataFrame({\"source\":[e[0] for e in connected_sample],\"target\":[e[1] for e in connected_sample]})\n", | |
"connected_cosine[\"cosine\"]= connected_cosine.apply(lambda row: cosine(row.source,row.target), axis=1)\n", | |
"# connected_cosine" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 110, | |
"id": "c5918c3c", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<Axes: >" | |
] | |
}, | |
"execution_count": 110, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<Figure size 640x480 with 1 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"connected_cosine.cosine.plot()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "576cfe84", | |
"metadata": {}, | |
"source": [ | |
"With an average cosine:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 111, | |
"id": "aa5792b9", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.16600344225511338" | |
] | |
}, | |
"execution_count": 111, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.average(connected_cosine.cosine)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 112, | |
"id": "e502eb5b", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"disconnected_cosine = pd.DataFrame({\"source\":[e[0] for e in disconnected_sample],\"target\":[e[1] for e in disconnected_sample]})\n", | |
"disconnected_cosine[\"cosine\"]= disconnected_cosine.apply(lambda row: cosine(row.source,row.target), axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f0168d1e", | |
"metadata": {}, | |
"source": [ | |
"Visual inspection reveals that the disconnected nodes have on average a lower consine:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 113, | |
"id": "ef51d9df", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<Axes: >" | |
] | |
}, | |
"execution_count": 113, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "", | |
"text/plain": [ | |
"<Figure size 640x480 with 1 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"disconnected_cosine.cosine.plot()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "d094178e", | |
"metadata": {}, | |
"source": [ | |
"On average the cosine similarity is three times higher for connected nodes:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 115, | |
"id": "7e8fca56", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"2.8722134876583296" | |
] | |
}, | |
"execution_count": 115, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.average(connected_cosine.cosine)/np.average(disconnected_cosine.cosine)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "10c0293c", | |
"metadata": {}, | |
"source": [ | |
"This shows that connectivity is correlated with the payload of the nodes and can be used to predict links." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "828b5706", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.10.11" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} | |
"# Cora\n", | |
"\n", | |
"A collection of Cora explorations. The focus is on machine learning and not visualization here. Jupyter notebooks are not adequate for visualization, see however the [yFiles Jupyter plugin](https://www.yworks.com/products/yfiles-graphs-for-jupyter).\n", | |
"\n", | |
"*Author*: Francois Vanderseypen, Orbifold Consulting (https://orbifold.net).<br>\n", | |
"*Article*: https://graphsandnetworks.com/the-cora-dataset<br>\n", | |
"*Last update*: July 2023.<br>"{{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "dbeead4f", | |
"metadata": {}, | |
"source": [ | |
"# Cora\n", | |
"\n", | |
"A collection of Cora explorations. The focus is on machine learning and not visualization here. Jupyter notebooks are not adequate for visualization, see however the [yFiles Jupyter plugin](https://www.yworks.com/products/yfiles-graphs-for-jupyter).\n", | |
"\n", | |
"*Author*: Francois Vanderseypen, Orbifold Consulting (https://orbifold.net).<br>\n", | |
"*Article*: https://graphsandnetworks.com/the-cora-dataset<br>\n", | |
"*Last update*: July 2023.<br>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f59e42e0", | |
"metadata": {}, | |
"source": [ | |
"## Download data\n", | |
"\n", | |
"This part is common to all packages, it downloads and unpacks the necessary data:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "a06b6d68", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import os\n", | |
"import pandas as pd\n", | |
"data_dir = os.path.expanduser(\"~/cora\")\n", | |
"if not os.path.exists(data_dir):\n", | |
" os.makedirs(data_dir)\n", | |
"import requests\n", | |
"\n", | |
" \n", | |
"cora_tgz = os.path.join(data_dir, \"cora.tgz\")\n", | |
"response = requests.get(\"https://temprl.com/cora.tgz\", stream = True)\n", | |
"with open(cora_tgz,'wb') as output:\n", | |
" output.write(response.content)\n", | |
"\n", | |
"import tarfile\n", | |
"with tarfile.open(cora_tgz) as z:\n", | |
" for member in z:\n", | |
" if member.isdir():\n", | |
" continue\n", | |
" fname = member.name.rsplit('/',1)[1]\n", | |
" z.makefile(member,data_dir + '/' + fname)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "aa3525da", | |
"metadata": {}, | |
"source": [ | |
"## NetworkX" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "9bd8820a", | |
"metadata": {}, | |
"source": [ | |
"NetworkX is the most common graph package in Python. It does not perform any machine learning but it has a very complete graph analysis API and performs well on small and medium datasets." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "60559b49", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import networkx as nx\n", | |
"\n", | |
"edge_data = pd.read_csv(os.path.join(data_dir, \"cora.cites\"), sep='\\t', header=None, names=[\"target\", \"source\"])\n", | |
"edge_data[\"label\"] = \"cites\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "70c0a3a1", | |
"metadata": {}, | |
"source": [ | |
"The edge list is just a source-target couple and there is no payload:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "ba16ed6c", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"edge_data.sample(frac=1).head(5)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "658349e2", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"Gnx = nx.from_pandas_edgelist(edge_data, edge_attr=\"label\")\n", | |
"nx.set_node_attributes(Gnx, \"paper\", \"label\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "f3e560f3", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
" Gnx.nodes[1103985]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "a9cafdda", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"feature_names = [\"w_{}\".format(ii) for ii in range(1433)]\n", | |
"column_names = feature_names + [\"subject\"]\n", | |
"node_data = pd.read_csv(os.path.join(data_dir, \"cora.content\"), sep='\\t', header=None, names=column_names)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "42c07961", | |
"metadata": {}, | |
"source": [ | |
"The payload on the node consists of the weights with the subject label:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "c27ee8a9", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"node_data.head(5)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "61af1435", | |
"metadata": {}, | |
"source": [ | |
"There are seven subjects:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "5f405208", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"set(node_data[\"subject\"])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "098bc658", | |
"metadata": {}, | |
"source": [ | |
"If you don't like the weights in multiple columns you can merge them:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "5ff5a8b3", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"weight_column_names = node_data.columns[0:-1]\n", | |
"node_data['content'] = node_data[weight_column_names].apply(\n", | |
" lambda x: ','.join(x.dropna().astype(str)),\n", | |
" axis=1\n", | |
")\n", | |
"node_data.drop(weight_column_names, axis=1, inplace=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "870e3fae", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"node_data.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "a1e65089", | |
"metadata": {}, | |
"source": [ | |
"Note that the content is not an embedding but is the encoded article content." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "30af6bfb", | |
"metadata": {}, | |
"source": [ | |
"### Poor man's path to link prediction: Jaccard" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "ae086f04", | |
"metadata": {}, | |
"source": [ | |
"Long before graph machine learning came along, people were predicting edges using very simple algorithms. The Jaccard index (algorithm) basically looks at how the immediate neighborhood of two nodes overlap and the more they overlap the more they are likely to be connected. The idea stems from social network analysis where the more friends you share with somebody, the more likely you know each other." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "fecc7d71", | |
"metadata": {}, | |
"source": [ | |
"The following is a manual calculation for some:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "6c6e78ac", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"for u,v in list(Gnx.edges)[12:20]:\n", | |
" cnbors = list(nx.common_neighbors(Gnx, u, v))\n", | |
" union_size = len(set(Gnx[u]) | set(Gnx[v])) \n", | |
" print(u,v, len(cnbors)/union_size)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "19be9453", | |
"metadata": {}, | |
"source": [ | |
"Using NetworkX you can do the whole graph in one go:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "60c3f13d", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"jaccard_predictions = list(nx.jaccard_coefficient(Gnx))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "e1a7c6e6", | |
"metadata": {}, | |
"source": [ | |
"Filtering out the most likely candidates:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "0453540a", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"jaccard_predictions_top = [(t[0],t[1]) for t in jaccard_predictions if t[2]>0.8]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "ce922f59", | |
"metadata": {}, | |
"source": [ | |
"Note that none of these are existing edges in the graph:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "de9a9463", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"[t for t in jaccard_predictions_top if Gnx.has_edge(*t)]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "9a3c6e79", | |
"metadata": {}, | |
"source": [ | |
"There are plenty of nodes which have a fully common neighborhood leading to a probability equal to one. The only case with a partially overlapping neighborhood is the following:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "e2337f85", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
" [t for t in jaccard_predictions if t[2]>0.8 and t[2]!=1]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "47d10dda", | |
"metadata": {}, | |
"source": [ | |
"You can see that they differ in a single node:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "9dd25c6f", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"print(\"Common: \",sorted(nx.common_neighbors(Gnx, 14428, 14430)))\n", | |
"print(\"14428:\", list(nx.neighbors(Gnx,14428)))\n", | |
"print(\"14430:\", list(nx.neighbors(Gnx,14430)))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "52e4985a", | |
"metadata": {}, | |
"source": [ | |
"The main problem with Jaccard is the fact that it does not take the payload into account, only the immediate topology is looked at. Even the topology, it's only the first hop and maybe node neighborhoods on a higher level have a lot in common.\n", | |
"This makes Jaccard indicative rather than reliable." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "33b9c715", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.10.11" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "dbeead4f", | |
"metadata": {}, | |
"source": [ | |
"# Cora\n", | |
"\n", | |
"A collection of Cora explorations. The focus is on machine learning and not visualization here. Jupyter notebooks are not adequate for visualization, see however the [yFiles Jupyter plugin](https://www.yworks.com/products/yfiles-graphs-for-jupyter).\n", | |
"\n", | |
"*Author*: Francois Vanderseypen, Orbifold Consulting (https://orbifold.net).<br>\n", | |
"*Article*: https://graphsandnetworks.com/the-cora-dataset<br>\n", | |
"*Last update*: July 2023.<br>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f59e42e0", | |
"metadata": {}, | |
"source": [ | |
"## Download data\n", | |
"\n", | |
"This part is common to all packages, it downloads and unpacks the necessary data:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"id": "a06b6d68", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import os\n", | |
"import pandas as pd\n", | |
"data_dir = os.path.expanduser(\"~/cora\")\n", | |
"if not os.path.exists(data_dir):\n", | |
" os.makedirs(data_dir)\n", | |
"import requests\n", | |
"\n", | |
" \n", | |
"cora_tgz = os.path.join(data_dir, \"cora.tgz\")\n", | |
"response = requests.get(\"https://temprl.com/cora.tgz\", stream = True)\n", | |
"with open(cora_tgz,'wb') as output:\n", | |
" output.write(response.content)\n", | |
"\n", | |
"import tarfile\n", | |
"with tarfile.open(cora_tgz) as z:\n", | |
" for member in z:\n", | |
" if member.isdir():\n", | |
" continue\n", | |
" fname = member.name.rsplit('/',1)[1]\n", | |
" z.makefile(member,data_dir + '/' + fname)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "aa3525da", | |
"metadata": {}, | |
"source": [ | |
"## NetworkX" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "9bd8820a", | |
"metadata": {}, | |
"source": [ | |
"NetworkX is the most common graph package in Python. It does not perform any machine learning but it has a very complete graph analysis API and performs well on small and medium datasets." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"id": "60559b49", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import networkx as nx\n", | |
"\n", | |
"edge_data = pd.read_csv(os.path.join(data_dir, \"cora.cites\"), sep='\\t', header=None, names=[\"target\", \"source\"])\n", | |
"edge_data[\"label\"] = \"cites\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "70c0a3a1", | |
"metadata": {}, | |
"source": [ | |
"The edge list is just a source-target couple and there is no payload:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"id": "ba16ed6c", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>target</th>\n", | |
" <th>source</th>\n", | |
" <th>label</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>2671</th>\n", | |
" <td>28957</td>\n", | |
" <td>35922</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3334</th>\n", | |
" <td>56115</td>\n", | |
" <td>135130</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1696</th>\n", | |
" <td>10183</td>\n", | |
" <td>259772</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1701</th>\n", | |
" <td>10430</td>\n", | |
" <td>1120713</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5040</th>\n", | |
" <td>578646</td>\n", | |
" <td>1153900</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" target source label\n", | |
"2671 28957 35922 cites\n", | |
"3334 56115 135130 cites\n", | |
"1696 10183 259772 cites\n", | |
"1701 10430 1120713 cites\n", | |
"5040 578646 1153900 cites" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"edge_data.sample(frac=1).head(5)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"id": "658349e2", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"Gnx = nx.from_pandas_edgelist(edge_data, edge_attr=\"label\")\n", | |
"nx.set_node_attributes(Gnx, \"paper\", \"label\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"id": "f3e560f3", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'label': 'paper'}" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
" Gnx.nodes[1103985]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"id": "a9cafdda", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"feature_names = [\"w_{}\".format(ii) for ii in range(1433)]\n", | |
"column_names = feature_names + [\"subject\"]\n", | |
"node_data = pd.read_csv(os.path.join(data_dir, \"cora.content\"), sep='\\t', header=None, names=column_names)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "42c07961", | |
"metadata": {}, | |
"source": [ | |
"The payload on the node consists of the weights with the subject label:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"id": "c27ee8a9", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>w_0</th>\n", | |
" <th>w_1</th>\n", | |
" <th>w_2</th>\n", | |
" <th>w_3</th>\n", | |
" <th>w_4</th>\n", | |
" <th>w_5</th>\n", | |
" <th>w_6</th>\n", | |
" <th>w_7</th>\n", | |
" <th>w_8</th>\n", | |
" <th>w_9</th>\n", | |
" <th>...</th>\n", | |
" <th>w_1424</th>\n", | |
" <th>w_1425</th>\n", | |
" <th>w_1426</th>\n", | |
" <th>w_1427</th>\n", | |
" <th>w_1428</th>\n", | |
" <th>w_1429</th>\n", | |
" <th>w_1430</th>\n", | |
" <th>w_1431</th>\n", | |
" <th>w_1432</th>\n", | |
" <th>subject</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>31336</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Neural_Networks</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1061127</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Rule_Learning</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1106406</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13195</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>37879</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Probabilistic_Methods</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>5 rows × 1434 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" w_0 w_1 w_2 w_3 w_4 w_5 w_6 w_7 w_8 w_9 ... w_1424 \\\n", | |
"31336 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"1061127 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"1106406 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"13195 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"37879 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"\n", | |
" w_1425 w_1426 w_1427 w_1428 w_1429 w_1430 w_1431 w_1432 \\\n", | |
"31336 0 1 0 0 0 0 0 0 \n", | |
"1061127 1 0 0 0 0 0 0 0 \n", | |
"1106406 0 0 0 0 0 0 0 0 \n", | |
"13195 0 0 0 0 0 0 0 0 \n", | |
"37879 0 0 0 0 0 0 0 0 \n", | |
"\n", | |
" subject \n", | |
"31336 Neural_Networks \n", | |
"1061127 Rule_Learning \n", | |
"1106406 Reinforcement_Learning \n", | |
"13195 Reinforcement_Learning \n", | |
"37879 Probabilistic_Methods \n", | |
"\n", | |
"[5 rows x 1434 columns]" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"node_data.head(5)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "61af1435", | |
"metadata": {}, | |
"source": [ | |
"There are seven subjects:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 45, | |
"id": "5f405208", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'Case_Based',\n", | |
" 'Genetic_Algorithms',\n", | |
" 'Neural_Networks',\n", | |
" 'Probabilistic_Methods',\n", | |
" 'Reinforcement_Learning',\n", | |
" 'Rule_Learning',\n", | |
" 'Theory'}" | |
] | |
}, | |
"execution_count": 45, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"set(node_data[\"subject\"])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "098bc658", | |
"metadata": {}, | |
"source": [ | |
"If you don't like the weights in multiple columns you can merge them:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"id": "5ff5a8b3", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"weight_column_names = node_data.columns[0:-1]\n", | |
"node_data['content'] = node_data[weight_column_names].apply(\n", | |
" lambda x: ','.join(x.dropna().astype(str)),\n", | |
" axis=1\n", | |
")\n", | |
"node_data.drop(weight_column_names, axis=1, inplace=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"id": "870e3fae", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>subject</th>\n", | |
" <th>content</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>31336</th>\n", | |
" <td>Neural_Networks</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1061127</th>\n", | |
" <td>Rule_Learning</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1106406</th>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13195</th>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>37879</th>\n", | |
" <td>Probabilistic_Methods</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" subject \\\n", | |
"31336 Neural_Networks \n", | |
"1061127 Rule_Learning \n", | |
"1106406 Reinforcement_Learning \n", | |
"13195 Reinforcement_Learning \n", | |
"37879 Probabilistic_Methods \n", | |
"\n", | |
" content \n", | |
"31336 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... \n", | |
"1061127 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,... \n", | |
"1106406 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... \n", | |
"13195 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... \n", | |
"37879 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... " | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"node_data.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "a1e65089", | |
"metadata": {}, | |
"source": [ | |
"Note that the content is not an embedding but is the encoded article content." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "30af6bfb", | |
"metadata": {}, | |
"source": [ | |
"### Poor man's path to link prediction: Jaccard" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "ae086f04", | |
"metadata": {}, | |
"source": [ | |
"Long before graph machine learning came along, people were predicting edges using very simple algorithms. The Jaccard index (algorithm) basically looks at how the immediate neighborhood of two nodes overlap and the more they overlap the more they are likely to be connected. The idea stems from social network analysis where the more friends you share with somebody, the more likely you know each other." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "fecc7d71", | |
"metadata": {}, | |
"source": [ | |
"The following is a manual calculation for some:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"id": "6c6e78ac", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"35 1113438 0.005813953488372093\n", | |
"35 1113831 0.0058823529411764705\n", | |
"35 1114331 0.01764705882352941\n", | |
"35 1117476 0.0058823529411764705\n", | |
"35 1119505 0.0\n", | |
"35 1119708 0.01764705882352941\n", | |
"35 1120431 0.0\n", | |
"35 1123756 0.005847953216374269\n" | |
] | |
} | |
], | |
"source": [ | |
"for u,v in list(Gnx.edges)[12:20]:\n", | |
" cnbors = list(nx.common_neighbors(Gnx, u, v))\n", | |
" union_size = len(set(Gnx[u]) | set(Gnx[v])) \n", | |
" print(u,v, len(cnbors)/union_size)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "19be9453", | |
"metadata": {}, | |
"source": [ | |
"Using NetworkX you can do the whole graph in one go:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"id": "60c3f13d", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"jaccard_predictions = list(nx.jaccard_coefficient(Gnx))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "e1a7c6e6", | |
"metadata": {}, | |
"source": [ | |
"Filtering out the most likely candidates:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"id": "0453540a", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"jaccard_predictions_top = [(t[0],t[1]) for t in jaccard_predictions if t[2]>0.8]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "ce922f59", | |
"metadata": {}, | |
"source": [ | |
"Note that none of these are existing edges in the graph:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 104, | |
"id": "de9a9463", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[]" | |
] | |
}, | |
"execution_count": 104, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"[t for t in jaccard_predictions_top if Gnx.has_edge(*t)]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "777c6d2f", | |
"metadata": {}, | |
"source": [ | |
"There are plenty of nodes which have a fully common neighborhood leading to a probability equal to one. The only case with a partially overlapping neighborhood is the following:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"id": "e2337f85", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[(14428, 14430, 0.8571428571428571)]" | |
] | |
}, | |
"execution_count": 30, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
" [t for t in jaccard_predictions if t[2]>0.8 and t[2]!=1]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "c59b74cc", | |
"metadata": {}, | |
"source": [ | |
"You can see that they differ in a single node:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 37, | |
"id": "d42a920a", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Common: [14429, 14431, 34082, 73119, 1103031, 1103969]\n", | |
"14428: [1103031, 1103969, 14429, 14431, 34082, 73119]\n", | |
"14430: [1103031, 1103969, 1119216, 14429, 14431, 34082, 73119]\n" | |
] | |
} | |
], | |
"source": [ | |
"print(\"Common: \",sorted(nx.common_neighbors(Gnx, 14428, 14430)))\n", | |
"print(\"14428:\", list(nx.neighbors(Gnx,14428)))\n", | |
"print(\"14430:\", list(nx.neighbors(Gnx,14430)))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "962f7bfb", | |
"metadata": {}, | |
"source": [ | |
"The main problem with Jaccard is the fact that it does not take the payload into account, only the immediate topology is looked at. Even the topology, it's only the first hop and maybe node neighborhoods on a higher level have a lot in common.\n", | |
"This makes Jaccard indicative rather than reliable." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "b36c1128", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.10.11" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "221a982a", | |
"metadata": {}, | |
"source": [ | |
"## Download data\n", | |
"\n", | |
"This part is common to all packages, it downloads and unpacks the necessary data:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"id": "a06b6d68", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import os\n", | |
"import pandas as pd\n", | |
"data_dir = os.path.expanduser(\"~/cora\")\n", | |
"if not os.path.exists(data_dir):\n", | |
" os.makedirs(data_dir)\n", | |
"import requests\n", | |
"\n", | |
" \n", | |
"cora_tgz = os.path.join(data_dir, \"cora.tgz\")\n", | |
"response = requests.get(\"https://temprl.com/cora.tgz\", stream = True)\n", | |
"with open(cora_tgz,'wb') as output:\n", | |
" output.write(response.content)\n", | |
"\n", | |
"import tarfile\n", | |
"with tarfile.open(cora_tgz) as z:\n", | |
" for member in z:\n", | |
" if member.isdir():\n", | |
" continue\n", | |
" fname = member.name.rsplit('/',1)[1]\n", | |
" z.makefile(member,data_dir + '/' + fname)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "99458e7d", | |
"metadata": {}, | |
"source": [ | |
"## NetworkX" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "dd2672ad", | |
"metadata": {}, | |
"source": [ | |
"NetworkX is the most common graph package in Python. It does not perform any machine learning but it has a very complete graph analysis API and performs well on small and medium datasets." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 39, | |
"id": "60559b49", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import networkx as nx\n", | |
"\n", | |
"edge_data = pd.read_csv(os.path.join(data_dir, \"cora.cites\"), sep='\\t', header=None, names=[\"target\", \"source\"])\n", | |
"edge_data[\"label\"] = \"cites\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "5de3c572", | |
"metadata": {}, | |
"source": [ | |
"The edge list is just a source-target couple and there is no payload:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 40, | |
"id": "8e58718d", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>target</th>\n", | |
" <th>source</th>\n", | |
" <th>label</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>4354</th>\n", | |
" <td>162664</td>\n", | |
" <td>531348</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>907</th>\n", | |
" <td>3232</td>\n", | |
" <td>20942</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4151</th>\n", | |
" <td>133553</td>\n", | |
" <td>1120049</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2607</th>\n", | |
" <td>28290</td>\n", | |
" <td>56709</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4943</th>\n", | |
" <td>523574</td>\n", | |
" <td>1154229</td>\n", | |
" <td>cites</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" target source label\n", | |
"4354 162664 531348 cites\n", | |
"907 3232 20942 cites\n", | |
"4151 133553 1120049 cites\n", | |
"2607 28290 56709 cites\n", | |
"4943 523574 1154229 cites" | |
] | |
}, | |
"execution_count": 40, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"edge_data.sample(frac=1).head(5)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 41, | |
"id": "2d12a50d", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"Gnx = nx.from_pandas_edgelist(edge_data, edge_attr=\"label\")\n", | |
"nx.set_node_attributes(Gnx, \"paper\", \"label\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 42, | |
"id": "33f2d088", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'label': 'paper'}" | |
] | |
}, | |
"execution_count": 42, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
" Gnx.nodes[1103985]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"id": "2e721d84", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"feature_names = [\"w_{}\".format(ii) for ii in range(1433)]\n", | |
"column_names = feature_names + [\"subject\"]\n", | |
"node_data = pd.read_csv(os.path.join(data_dir, \"cora.content\"), sep='\\t', header=None, names=column_names)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "89b1c2e1", | |
"metadata": {}, | |
"source": [ | |
"The payload on the node consists of the weights with the subject label:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 44, | |
"id": "7e9df542", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>w_0</th>\n", | |
" <th>w_1</th>\n", | |
" <th>w_2</th>\n", | |
" <th>w_3</th>\n", | |
" <th>w_4</th>\n", | |
" <th>w_5</th>\n", | |
" <th>w_6</th>\n", | |
" <th>w_7</th>\n", | |
" <th>w_8</th>\n", | |
" <th>w_9</th>\n", | |
" <th>...</th>\n", | |
" <th>w_1424</th>\n", | |
" <th>w_1425</th>\n", | |
" <th>w_1426</th>\n", | |
" <th>w_1427</th>\n", | |
" <th>w_1428</th>\n", | |
" <th>w_1429</th>\n", | |
" <th>w_1430</th>\n", | |
" <th>w_1431</th>\n", | |
" <th>w_1432</th>\n", | |
" <th>subject</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>31336</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Neural_Networks</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1061127</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Rule_Learning</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1106406</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13195</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>37879</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Probabilistic_Methods</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>5 rows × 1434 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" w_0 w_1 w_2 w_3 w_4 w_5 w_6 w_7 w_8 w_9 ... w_1424 \\\n", | |
"31336 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"1061127 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"1106406 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"13195 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"37879 0 0 0 0 0 0 0 0 0 0 ... 0 \n", | |
"\n", | |
" w_1425 w_1426 w_1427 w_1428 w_1429 w_1430 w_1431 w_1432 \\\n", | |
"31336 0 1 0 0 0 0 0 0 \n", | |
"1061127 1 0 0 0 0 0 0 0 \n", | |
"1106406 0 0 0 0 0 0 0 0 \n", | |
"13195 0 0 0 0 0 0 0 0 \n", | |
"37879 0 0 0 0 0 0 0 0 \n", | |
"\n", | |
" subject \n", | |
"31336 Neural_Networks \n", | |
"1061127 Rule_Learning \n", | |
"1106406 Reinforcement_Learning \n", | |
"13195 Reinforcement_Learning \n", | |
"37879 Probabilistic_Methods \n", | |
"\n", | |
"[5 rows x 1434 columns]" | |
] | |
}, | |
"execution_count": 44, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"node_data.head(5)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "46c3f09d", | |
"metadata": {}, | |
"source": [ | |
"There are seven subjects:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 45, | |
"id": "72aedfcf", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'Case_Based',\n", | |
" 'Genetic_Algorithms',\n", | |
" 'Neural_Networks',\n", | |
" 'Probabilistic_Methods',\n", | |
" 'Reinforcement_Learning',\n", | |
" 'Rule_Learning',\n", | |
" 'Theory'}" | |
] | |
}, | |
"execution_count": 45, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"set(node_data[\"subject\"])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "803e0b6d", | |
"metadata": {}, | |
"source": [ | |
"If you don't like the weights in multiple columns you can merge them:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 46, | |
"id": "d287d7b9", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"weight_column_names = node_data.columns[0:-1]\n", | |
"node_data['content'] = node_data[weight_column_names].apply(\n", | |
" lambda x: ','.join(x.dropna().astype(str)),\n", | |
" axis=1\n", | |
")\n", | |
"node_data.drop(weight_column_names, axis=1, inplace=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 47, | |
"id": "611826dc", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>subject</th>\n", | |
" <th>content</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>31336</th>\n", | |
" <td>Neural_Networks</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1061127</th>\n", | |
" <td>Rule_Learning</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1106406</th>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13195</th>\n", | |
" <td>Reinforcement_Learning</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>37879</th>\n", | |
" <td>Probabilistic_Methods</td>\n", | |
" <td>0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" subject \\\n", | |
"31336 Neural_Networks \n", | |
"1061127 Rule_Learning \n", | |
"1106406 Reinforcement_Learning \n", | |
"13195 Reinforcement_Learning \n", | |
"37879 Probabilistic_Methods \n", | |
"\n", | |
" content \n", | |
"31336 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... \n", | |
"1061127 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,... \n", | |
"1106406 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... \n", | |
"13195 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... \n", | |
"37879 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... " | |
] | |
}, | |
"execution_count": 47, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"node_data.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "e6c9a46e", | |
"metadata": {}, | |
"source": [ | |
"Note that the content is not an embedding but is the encoded article content." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "17f37893", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.10.11" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment