Created
March 5, 2023 21:59
-
-
Save MaxHalford/a92f70400d1baeee055de3749071e954 to your computer and use it in GitHub Desktop.
GoDaddy Microbusiness Density Forecasting Competition
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Solution" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This is my solution to [this](https://www.kaggle.com/competitions/godaddy-microbusiness-density-forecasting/) Kaggle competition. This is not a serious attempt at getting a high score. Instead, I developed a methodology for forecasting multiple steps ahead, using past predictions as features.\n", | |
"\n", | |
"Take any time series forecasting task. For the first step ahead, you can use the training data to build lagged features. But for the second step, you can't do that, because the ground truth isn't available. However, you could use the prediction obtained for the first step as a proxy. Then, for training, you need the model to be trained on the same data as for the first step, except that the previous value should be replaced by the out-of-fold prediction of the first step.\n", | |
"\n", | |
"This is quite tricky to get right. In this competiton, the length of each time series is the same. I thus stored the values for each step in columns. Every time I did predicted a step ahead, I replaced the past values accordingly. I'm not sure how to make this work by keeping the values stored in a single column. There's probably a tidy data trick to do this." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"\n", | |
"def smape(A, F):\n", | |
" return 100 / len(A) * np.sum(2 * np.abs(F - A) / (np.abs(A) + np.abs(F)))" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Data loading" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>cfips</th>\n", | |
" <th>county</th>\n", | |
" <th>state</th>\n", | |
" <th>first_day_of_month</th>\n", | |
" <th>microbusiness_density</th>\n", | |
" <th>active</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>row_id</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>1001_2019-08-01</th>\n", | |
" <td>1001</td>\n", | |
" <td>Autauga County</td>\n", | |
" <td>Alabama</td>\n", | |
" <td>2019-08-01</td>\n", | |
" <td>3.007682</td>\n", | |
" <td>1249</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1001_2019-09-01</th>\n", | |
" <td>1001</td>\n", | |
" <td>Autauga County</td>\n", | |
" <td>Alabama</td>\n", | |
" <td>2019-09-01</td>\n", | |
" <td>2.884870</td>\n", | |
" <td>1198</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1001_2019-10-01</th>\n", | |
" <td>1001</td>\n", | |
" <td>Autauga County</td>\n", | |
" <td>Alabama</td>\n", | |
" <td>2019-10-01</td>\n", | |
" <td>3.055843</td>\n", | |
" <td>1269</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1001_2019-11-01</th>\n", | |
" <td>1001</td>\n", | |
" <td>Autauga County</td>\n", | |
" <td>Alabama</td>\n", | |
" <td>2019-11-01</td>\n", | |
" <td>2.993233</td>\n", | |
" <td>1243</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1001_2019-12-01</th>\n", | |
" <td>1001</td>\n", | |
" <td>Autauga County</td>\n", | |
" <td>Alabama</td>\n", | |
" <td>2019-12-01</td>\n", | |
" <td>2.993233</td>\n", | |
" <td>1243</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" cfips county state first_day_of_month \\\n", | |
"row_id \n", | |
"1001_2019-08-01 1001 Autauga County Alabama 2019-08-01 \n", | |
"1001_2019-09-01 1001 Autauga County Alabama 2019-09-01 \n", | |
"1001_2019-10-01 1001 Autauga County Alabama 2019-10-01 \n", | |
"1001_2019-11-01 1001 Autauga County Alabama 2019-11-01 \n", | |
"1001_2019-12-01 1001 Autauga County Alabama 2019-12-01 \n", | |
"\n", | |
" microbusiness_density active \n", | |
"row_id \n", | |
"1001_2019-08-01 3.007682 1249 \n", | |
"1001_2019-09-01 2.884870 1198 \n", | |
"1001_2019-10-01 3.055843 1269 \n", | |
"1001_2019-11-01 2.993233 1243 \n", | |
"1001_2019-12-01 2.993233 1243 " | |
] | |
}, | |
"execution_count": 2, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import pandas as pd\n", | |
"\n", | |
"train = pd.read_csv('data/train.csv', index_col='row_id')\n", | |
"train.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>cfips</th>\n", | |
" <th>first_day_of_month</th>\n", | |
" <th>county</th>\n", | |
" <th>state</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>row_id</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>1001_2022-11-01</th>\n", | |
" <td>1001</td>\n", | |
" <td>2022-11-01</td>\n", | |
" <td>Autauga County</td>\n", | |
" <td>Alabama</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1003_2022-11-01</th>\n", | |
" <td>1003</td>\n", | |
" <td>2022-11-01</td>\n", | |
" <td>Baldwin County</td>\n", | |
" <td>Alabama</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1005_2022-11-01</th>\n", | |
" <td>1005</td>\n", | |
" <td>2022-11-01</td>\n", | |
" <td>Barbour County</td>\n", | |
" <td>Alabama</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1007_2022-11-01</th>\n", | |
" <td>1007</td>\n", | |
" <td>2022-11-01</td>\n", | |
" <td>Bibb County</td>\n", | |
" <td>Alabama</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1009_2022-11-01</th>\n", | |
" <td>1009</td>\n", | |
" <td>2022-11-01</td>\n", | |
" <td>Blount County</td>\n", | |
" <td>Alabama</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" cfips first_day_of_month county state\n", | |
"row_id \n", | |
"1001_2022-11-01 1001 2022-11-01 Autauga County Alabama\n", | |
"1003_2022-11-01 1003 2022-11-01 Baldwin County Alabama\n", | |
"1005_2022-11-01 1005 2022-11-01 Barbour County Alabama\n", | |
"1007_2022-11-01 1007 2022-11-01 Bibb County Alabama\n", | |
"1009_2022-11-01 1009 2022-11-01 Blount County Alabama" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"cfips_to_county = train.groupby('cfips')['county'].first()\n", | |
"cfips_to_state = train.groupby('cfips')['state'].first()\n", | |
"test = (\n", | |
" pd.read_csv('data/test.csv', index_col='row_id')\n", | |
" .assign(county=lambda df: df.cfips.map(cfips_to_county))\n", | |
" .assign(state=lambda df: df.cfips.map(cfips_to_state))\n", | |
")\n", | |
"test.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>cfips</th>\n", | |
" <th>county</th>\n", | |
" <th>state</th>\n", | |
" <th>month</th>\n", | |
" <th>target</th>\n", | |
" <th>active</th>\n", | |
" <th>is_train</th>\n", | |
" <th>lat</th>\n", | |
" <th>lng</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>row_id</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>1001_2019-08-01</th>\n", | |
" <td>1001</td>\n", | |
" <td>Autauga County</td>\n", | |
" <td>Alabama</td>\n", | |
" <td>2019-08-01</td>\n", | |
" <td>3.007682</td>\n", | |
" <td>1249.0</td>\n", | |
" <td>True</td>\n", | |
" <td>32.535142</td>\n", | |
" <td>-86.6429</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1001_2019-09-01</th>\n", | |
" <td>1001</td>\n", | |
" <td>Autauga County</td>\n", | |
" <td>Alabama</td>\n", | |
" <td>2019-09-01</td>\n", | |
" <td>2.884870</td>\n", | |
" <td>1198.0</td>\n", | |
" <td>True</td>\n", | |
" <td>32.535142</td>\n", | |
" <td>-86.6429</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1001_2019-10-01</th>\n", | |
" <td>1001</td>\n", | |
" <td>Autauga County</td>\n", | |
" <td>Alabama</td>\n", | |
" <td>2019-10-01</td>\n", | |
" <td>3.055843</td>\n", | |
" <td>1269.0</td>\n", | |
" <td>True</td>\n", | |
" <td>32.535142</td>\n", | |
" <td>-86.6429</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1001_2019-11-01</th>\n", | |
" <td>1001</td>\n", | |
" <td>Autauga County</td>\n", | |
" <td>Alabama</td>\n", | |
" <td>2019-11-01</td>\n", | |
" <td>2.993233</td>\n", | |
" <td>1243.0</td>\n", | |
" <td>True</td>\n", | |
" <td>32.535142</td>\n", | |
" <td>-86.6429</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1001_2019-12-01</th>\n", | |
" <td>1001</td>\n", | |
" <td>Autauga County</td>\n", | |
" <td>Alabama</td>\n", | |
" <td>2019-12-01</td>\n", | |
" <td>2.993233</td>\n", | |
" <td>1243.0</td>\n", | |
" <td>True</td>\n", | |
" <td>32.535142</td>\n", | |
" <td>-86.6429</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" cfips county state month target active \\\n", | |
"row_id \n", | |
"1001_2019-08-01 1001 Autauga County Alabama 2019-08-01 3.007682 1249.0 \n", | |
"1001_2019-09-01 1001 Autauga County Alabama 2019-09-01 2.884870 1198.0 \n", | |
"1001_2019-10-01 1001 Autauga County Alabama 2019-10-01 3.055843 1269.0 \n", | |
"1001_2019-11-01 1001 Autauga County Alabama 2019-11-01 2.993233 1243.0 \n", | |
"1001_2019-12-01 1001 Autauga County Alabama 2019-12-01 2.993233 1243.0 \n", | |
"\n", | |
" is_train lat lng \n", | |
"row_id \n", | |
"1001_2019-08-01 True 32.535142 -86.6429 \n", | |
"1001_2019-09-01 True 32.535142 -86.6429 \n", | |
"1001_2019-10-01 True 32.535142 -86.6429 \n", | |
"1001_2019-11-01 True 32.535142 -86.6429 \n", | |
"1001_2019-12-01 True 32.535142 -86.6429 " | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"ds = pd.concat([train, test])\n", | |
"ds['cfips'] = ds['cfips'].astype('category')\n", | |
"ds['county'] = ds['county'].astype('category')\n", | |
"ds['state'] = ds['state'].astype('category')\n", | |
"ds['is_train'] = ds.microbusiness_density.notnull()\n", | |
"ds = ds.rename(columns={'first_day_of_month': 'month', 'microbusiness_density': 'target'})\n", | |
"ds.month = pd.to_datetime(ds.month)\n", | |
"locations = pd.read_csv('data/cfips_location.csv', index_col='cfips')\n", | |
"ds = ds.join(locations[['lat', 'lng']], on='cfips')\n", | |
"ds.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"application/vnd.plotly.v1+json": { | |
"config": { | |
"plotlyServerURL": "https://plot.ly" | |
}, | |
"data": [ | |
{ | |
"hovertemplate": "month=%{x}<br>target=%{y}<extra></extra>", | |
"legendgroup": "", | |
"line": { | |
"color": "#636efa", | |
"dash": "solid" | |
}, | |
"marker": { | |
"symbol": "circle" | |
}, | |
"mode": "lines", | |
"name": "", | |
"orientation": "v", | |
"showlegend": false, | |
"type": "scatter", | |
"x": [ | |
"2019-08-01T00:00:00", | |
"2019-09-01T00:00:00", | |
"2019-10-01T00:00:00", | |
"2019-11-01T00:00:00", | |
"2019-12-01T00:00:00", | |
"2020-01-01T00:00:00", | |
"2020-02-01T00:00:00", | |
"2020-03-01T00:00:00", | |
"2020-04-01T00:00:00", | |
"2020-05-01T00:00:00", | |
"2020-06-01T00:00:00", | |
"2020-07-01T00:00:00", | |
"2020-08-01T00:00:00", | |
"2020-09-01T00:00:00", | |
"2020-10-01T00:00:00", | |
"2020-11-01T00:00:00", | |
"2020-12-01T00:00:00", | |
"2021-01-01T00:00:00", | |
"2021-02-01T00:00:00", | |
"2021-03-01T00:00:00", | |
"2021-04-01T00:00:00", | |
"2021-05-01T00:00:00", | |
"2021-06-01T00:00:00", | |
"2021-07-01T00:00:00", | |
"2021-08-01T00:00:00", | |
"2021-09-01T00:00:00", | |
"2021-10-01T00:00:00", | |
"2021-11-01T00:00:00", | |
"2021-12-01T00:00:00", | |
"2022-01-01T00:00:00", | |
"2022-02-01T00:00:00", | |
"2022-03-01T00:00:00", | |
"2022-04-01T00:00:00", | |
"2022-05-01T00:00:00", | |
"2022-06-01T00:00:00", | |
"2022-07-01T00:00:00", | |
"2022-08-01T00:00:00", | |
"2022-09-01T00:00:00", | |
"2022-10-01T00:00:00", | |
"2022-11-01T00:00:00", | |
"2022-12-01T00:00:00", | |
"2023-01-01T00:00:00", | |
"2023-02-01T00:00:00", | |
"2023-03-01T00:00:00", | |
"2023-04-01T00:00:00", | |
"2023-05-01T00:00:00", | |
"2023-06-01T00:00:00" | |
], | |
"xaxis": "x", | |
"y": [ | |
12.555554, | |
12.50948, | |
12.535927, | |
12.550398, | |
12.491517, | |
12.376486, | |
12.067235, | |
12.139173, | |
12.192754, | |
12.171586, | |
12.184155, | |
12.261385, | |
12.288671, | |
12.253282, | |
12.22186, | |
12.155214, | |
12.149261, | |
11.946164, | |
11.066442, | |
11.124837, | |
11.142438, | |
11.099834, | |
10.978436, | |
10.986661, | |
10.988141, | |
10.952939, | |
10.972843, | |
11.387701, | |
11.44034, | |
11.462683, | |
11.434597, | |
11.529694, | |
11.532157, | |
11.43936, | |
11.544476, | |
11.664701, | |
11.622984, | |
11.615921, | |
11.625118, | |
null, | |
null, | |
null, | |
null, | |
null, | |
null, | |
null, | |
null | |
], | |
"yaxis": "y" | |
} | |
], | |
"layout": { | |
"legend": { | |
"tracegroupgap": 0 | |
}, | |
"template": { | |
"data": { | |
"bar": [ | |
{ | |
"error_x": { | |
"color": "#2a3f5f" | |
}, | |
"error_y": { | |
"color": "#2a3f5f" | |
}, | |
"marker": { | |
"line": { | |
"color": "#E5ECF6", | |
"width": 0.5 | |
}, | |
"pattern": { | |
"fillmode": "overlay", | |
"size": 10, | |
"solidity": 0.2 | |
} | |
}, | |
"type": "bar" | |
} | |
], | |
"barpolar": [ | |
{ | |
"marker": { | |
"line": { | |
"color": "#E5ECF6", | |
"width": 0.5 | |
}, | |
"pattern": { | |
"fillmode": "overlay", | |
"size": 10, | |
"solidity": 0.2 | |
} | |
}, | |
"type": "barpolar" | |
} | |
], | |
"carpet": [ | |
{ | |
"aaxis": { | |
"endlinecolor": "#2a3f5f", | |
"gridcolor": "white", | |
"linecolor": "white", | |
"minorgridcolor": "white", | |
"startlinecolor": "#2a3f5f" | |
}, | |
"baxis": { | |
"endlinecolor": "#2a3f5f", | |
"gridcolor": "white", | |
"linecolor": "white", | |
"minorgridcolor": "white", | |
"startlinecolor": "#2a3f5f" | |
}, | |
"type": "carpet" | |
} | |
], | |
"choropleth": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"type": "choropleth" | |
} | |
], | |
"contour": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"colorscale": [ | |
[ | |
0, | |
"#0d0887" | |
], | |
[ | |
0.1111111111111111, | |
"#46039f" | |
], | |
[ | |
0.2222222222222222, | |
"#7201a8" | |
], | |
[ | |
0.3333333333333333, | |
"#9c179e" | |
], | |
[ | |
0.4444444444444444, | |
"#bd3786" | |
], | |
[ | |
0.5555555555555556, | |
"#d8576b" | |
], | |
[ | |
0.6666666666666666, | |
"#ed7953" | |
], | |
[ | |
0.7777777777777778, | |
"#fb9f3a" | |
], | |
[ | |
0.8888888888888888, | |
"#fdca26" | |
], | |
[ | |
1, | |
"#f0f921" | |
] | |
], | |
"type": "contour" | |
} | |
], | |
"contourcarpet": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"type": "contourcarpet" | |
} | |
], | |
"heatmap": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"colorscale": [ | |
[ | |
0, | |
"#0d0887" | |
], | |
[ | |
0.1111111111111111, | |
"#46039f" | |
], | |
[ | |
0.2222222222222222, | |
"#7201a8" | |
], | |
[ | |
0.3333333333333333, | |
"#9c179e" | |
], | |
[ | |
0.4444444444444444, | |
"#bd3786" | |
], | |
[ | |
0.5555555555555556, | |
"#d8576b" | |
], | |
[ | |
0.6666666666666666, | |
"#ed7953" | |
], | |
[ | |
0.7777777777777778, | |
"#fb9f3a" | |
], | |
[ | |
0.8888888888888888, | |
"#fdca26" | |
], | |
[ | |
1, | |
"#f0f921" | |
] | |
], | |
"type": "heatmap" | |
} | |
], | |
"heatmapgl": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"colorscale": [ | |
[ | |
0, | |
"#0d0887" | |
], | |
[ | |
0.1111111111111111, | |
"#46039f" | |
], | |
[ | |
0.2222222222222222, | |
"#7201a8" | |
], | |
[ | |
0.3333333333333333, | |
"#9c179e" | |
], | |
[ | |
0.4444444444444444, | |
"#bd3786" | |
], | |
[ | |
0.5555555555555556, | |
"#d8576b" | |
], | |
[ | |
0.6666666666666666, | |
"#ed7953" | |
], | |
[ | |
0.7777777777777778, | |
"#fb9f3a" | |
], | |
[ | |
0.8888888888888888, | |
"#fdca26" | |
], | |
[ | |
1, | |
"#f0f921" | |
] | |
], | |
"type": "heatmapgl" | |
} | |
], | |
"histogram": [ | |
{ | |
"marker": { | |
"pattern": { | |
"fillmode": "overlay", | |
"size": 10, | |
"solidity": 0.2 | |
} | |
}, | |
"type": "histogram" | |
} | |
], | |
"histogram2d": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"colorscale": [ | |
[ | |
0, | |
"#0d0887" | |
], | |
[ | |
0.1111111111111111, | |
"#46039f" | |
], | |
[ | |
0.2222222222222222, | |
"#7201a8" | |
], | |
[ | |
0.3333333333333333, | |
"#9c179e" | |
], | |
[ | |
0.4444444444444444, | |
"#bd3786" | |
], | |
[ | |
0.5555555555555556, | |
"#d8576b" | |
], | |
[ | |
0.6666666666666666, | |
"#ed7953" | |
], | |
[ | |
0.7777777777777778, | |
"#fb9f3a" | |
], | |
[ | |
0.8888888888888888, | |
"#fdca26" | |
], | |
[ | |
1, | |
"#f0f921" | |
] | |
], | |
"type": "histogram2d" | |
} | |
], | |
"histogram2dcontour": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"colorscale": [ | |
[ | |
0, | |
"#0d0887" | |
], | |
[ | |
0.1111111111111111, | |
"#46039f" | |
], | |
[ | |
0.2222222222222222, | |
"#7201a8" | |
], | |
[ | |
0.3333333333333333, | |
"#9c179e" | |
], | |
[ | |
0.4444444444444444, | |
"#bd3786" | |
], | |
[ | |
0.5555555555555556, | |
"#d8576b" | |
], | |
[ | |
0.6666666666666666, | |
"#ed7953" | |
], | |
[ | |
0.7777777777777778, | |
"#fb9f3a" | |
], | |
[ | |
0.8888888888888888, | |
"#fdca26" | |
], | |
[ | |
1, | |
"#f0f921" | |
] | |
], | |
"type": "histogram2dcontour" | |
} | |
], | |
"mesh3d": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"type": "mesh3d" | |
} | |
], | |
"parcoords": [ | |
{ | |
"line": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "parcoords" | |
} | |
], | |
"pie": [ | |
{ | |
"automargin": true, | |
"type": "pie" | |
} | |
], | |
"scatter": [ | |
{ | |
"fillpattern": { | |
"fillmode": "overlay", | |
"size": 10, | |
"solidity": 0.2 | |
}, | |
"type": "scatter" | |
} | |
], | |
"scatter3d": [ | |
{ | |
"line": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"marker": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "scatter3d" | |
} | |
], | |
"scattercarpet": [ | |
{ | |
"marker": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "scattercarpet" | |
} | |
], | |
"scattergeo": [ | |
{ | |
"marker": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "scattergeo" | |
} | |
], | |
"scattergl": [ | |
{ | |
"marker": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "scattergl" | |
} | |
], | |
"scattermapbox": [ | |
{ | |
"marker": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "scattermapbox" | |
} | |
], | |
"scatterpolar": [ | |
{ | |
"marker": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "scatterpolar" | |
} | |
], | |
"scatterpolargl": [ | |
{ | |
"marker": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "scatterpolargl" | |
} | |
], | |
"scatterternary": [ | |
{ | |
"marker": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "scatterternary" | |
} | |
], | |
"surface": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"colorscale": [ | |
[ | |
0, | |
"#0d0887" | |
], | |
[ | |
0.1111111111111111, | |
"#46039f" | |
], | |
[ | |
0.2222222222222222, | |
"#7201a8" | |
], | |
[ | |
0.3333333333333333, | |
"#9c179e" | |
], | |
[ | |
0.4444444444444444, | |
"#bd3786" | |
], | |
[ | |
0.5555555555555556, | |
"#d8576b" | |
], | |
[ | |
0.6666666666666666, | |
"#ed7953" | |
], | |
[ | |
0.7777777777777778, | |
"#fb9f3a" | |
], | |
[ | |
0.8888888888888888, | |
"#fdca26" | |
], | |
[ | |
1, | |
"#f0f921" | |
] | |
], | |
"type": "surface" | |
} | |
], | |
"table": [ | |
{ | |
"cells": { | |
"fill": { | |
"color": "#EBF0F8" | |
}, | |
"line": { | |
"color": "white" | |
} | |
}, | |
"header": { | |
"fill": { | |
"color": "#C8D4E3" | |
}, | |
"line": { | |
"color": "white" | |
} | |
}, | |
"type": "table" | |
} | |
] | |
}, | |
"layout": { | |
"annotationdefaults": { | |
"arrowcolor": "#2a3f5f", | |
"arrowhead": 0, | |
"arrowwidth": 1 | |
}, | |
"autotypenumbers": "strict", | |
"coloraxis": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"colorscale": { | |
"diverging": [ | |
[ | |
0, | |
"#8e0152" | |
], | |
[ | |
0.1, | |
"#c51b7d" | |
], | |
[ | |
0.2, | |
"#de77ae" | |
], | |
[ | |
0.3, | |
"#f1b6da" | |
], | |
[ | |
0.4, | |
"#fde0ef" | |
], | |
[ | |
0.5, | |
"#f7f7f7" | |
], | |
[ | |
0.6, | |
"#e6f5d0" | |
], | |
[ | |
0.7, | |
"#b8e186" | |
], | |
[ | |
0.8, | |
"#7fbc41" | |
], | |
[ | |
0.9, | |
"#4d9221" | |
], | |
[ | |
1, | |
"#276419" | |
] | |
], | |
"sequential": [ | |
[ | |
0, | |
"#0d0887" | |
], | |
[ | |
0.1111111111111111, | |
"#46039f" | |
], | |
[ | |
0.2222222222222222, | |
"#7201a8" | |
], | |
[ | |
0.3333333333333333, | |
"#9c179e" | |
], | |
[ | |
0.4444444444444444, | |
"#bd3786" | |
], | |
[ | |
0.5555555555555556, | |
"#d8576b" | |
], | |
[ | |
0.6666666666666666, | |
"#ed7953" | |
], | |
[ | |
0.7777777777777778, | |
"#fb9f3a" | |
], | |
[ | |
0.8888888888888888, | |
"#fdca26" | |
], | |
[ | |
1, | |
"#f0f921" | |
] | |
], | |
"sequentialminus": [ | |
[ | |
0, | |
"#0d0887" | |
], | |
[ | |
0.1111111111111111, | |
"#46039f" | |
], | |
[ | |
0.2222222222222222, | |
"#7201a8" | |
], | |
[ | |
0.3333333333333333, | |
"#9c179e" | |
], | |
[ | |
0.4444444444444444, | |
"#bd3786" | |
], | |
[ | |
0.5555555555555556, | |
"#d8576b" | |
], | |
[ | |
0.6666666666666666, | |
"#ed7953" | |
], | |
[ | |
0.7777777777777778, | |
"#fb9f3a" | |
], | |
[ | |
0.8888888888888888, | |
"#fdca26" | |
], | |
[ | |
1, | |
"#f0f921" | |
] | |
] | |
}, | |
"colorway": [ | |
"#636efa", | |
"#EF553B", | |
"#00cc96", | |
"#ab63fa", | |
"#FFA15A", | |
"#19d3f3", | |
"#FF6692", | |
"#B6E880", | |
"#FF97FF", | |
"#FECB52" | |
], | |
"font": { | |
"color": "#2a3f5f" | |
}, | |
"geo": { | |
"bgcolor": "white", | |
"lakecolor": "white", | |
"landcolor": "#E5ECF6", | |
"showlakes": true, | |
"showland": true, | |
"subunitcolor": "white" | |
}, | |
"hoverlabel": { | |
"align": "left" | |
}, | |
"hovermode": "closest", | |
"mapbox": { | |
"style": "light" | |
}, | |
"paper_bgcolor": "white", | |
"plot_bgcolor": "#E5ECF6", | |
"polar": { | |
"angularaxis": { | |
"gridcolor": "white", | |
"linecolor": "white", | |
"ticks": "" | |
}, | |
"bgcolor": "#E5ECF6", | |
"radialaxis": { | |
"gridcolor": "white", | |
"linecolor": "white", | |
"ticks": "" | |
} | |
}, | |
"scene": { | |
"xaxis": { | |
"backgroundcolor": "#E5ECF6", | |
"gridcolor": "white", | |
"gridwidth": 2, | |
"linecolor": "white", | |
"showbackground": true, | |
"ticks": "", | |
"zerolinecolor": "white" | |
}, | |
"yaxis": { | |
"backgroundcolor": "#E5ECF6", | |
"gridcolor": "white", | |
"gridwidth": 2, | |
"linecolor": "white", | |
"showbackground": true, | |
"ticks": "", | |
"zerolinecolor": "white" | |
}, | |
"zaxis": { | |
"backgroundcolor": "#E5ECF6", | |
"gridcolor": "white", | |
"gridwidth": 2, | |
"linecolor": "white", | |
"showbackground": true, | |
"ticks": "", | |
"zerolinecolor": "white" | |
} | |
}, | |
"shapedefaults": { | |
"line": { | |
"color": "#2a3f5f" | |
} | |
}, | |
"ternary": { | |
"aaxis": { | |
"gridcolor": "white", | |
"linecolor": "white", | |
"ticks": "" | |
}, | |
"baxis": { | |
"gridcolor": "white", | |
"linecolor": "white", | |
"ticks": "" | |
}, | |
"bgcolor": "#E5ECF6", | |
"caxis": { | |
"gridcolor": "white", | |
"linecolor": "white", | |
"ticks": "" | |
} | |
}, | |
"title": { | |
"x": 0.05 | |
}, | |
"xaxis": { | |
"automargin": true, | |
"gridcolor": "white", | |
"linecolor": "white", | |
"ticks": "", | |
"title": { | |
"standoff": 15 | |
}, | |
"zerolinecolor": "white", | |
"zerolinewidth": 2 | |
}, | |
"yaxis": { | |
"automargin": true, | |
"gridcolor": "white", | |
"linecolor": "white", | |
"ticks": "", | |
"title": { | |
"standoff": 15 | |
}, | |
"zerolinecolor": "white", | |
"zerolinewidth": 2 | |
} | |
} | |
}, | |
"title": { | |
"text": "6081" | |
}, | |
"xaxis": { | |
"anchor": "y", | |
"domain": [ | |
0, | |
1 | |
], | |
"title": { | |
"text": "month" | |
} | |
}, | |
"yaxis": { | |
"anchor": "x", | |
"domain": [ | |
0, | |
1 | |
], | |
"title": { | |
"text": "target" | |
} | |
} | |
} | |
} | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"cfips = ds.query('is_train').cfips.sample().unique()[0]\n", | |
"#cfips = 32029\n", | |
"ds.query('cfips == @cfips').plot(x='month', y='target', title=str(cfips))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>pct_bb_2017</th>\n", | |
" <th>pct_bb_2018</th>\n", | |
" <th>pct_bb_2019</th>\n", | |
" <th>pct_bb_2020</th>\n", | |
" <th>pct_bb_2021</th>\n", | |
" <th>pct_college_2017</th>\n", | |
" <th>pct_college_2018</th>\n", | |
" <th>pct_college_2019</th>\n", | |
" <th>pct_college_2020</th>\n", | |
" <th>pct_college_2021</th>\n", | |
" <th>pct_foreign_born_2017</th>\n", | |
" <th>pct_foreign_born_2018</th>\n", | |
" <th>pct_foreign_born_2019</th>\n", | |
" <th>pct_foreign_born_2020</th>\n", | |
" <th>pct_foreign_born_2021</th>\n", | |
" <th>pct_it_workers_2017</th>\n", | |
" <th>pct_it_workers_2018</th>\n", | |
" <th>pct_it_workers_2019</th>\n", | |
" <th>pct_it_workers_2020</th>\n", | |
" <th>pct_it_workers_2021</th>\n", | |
" <th>median_hh_inc_2017</th>\n", | |
" <th>median_hh_inc_2018</th>\n", | |
" <th>median_hh_inc_2019</th>\n", | |
" <th>median_hh_inc_2020</th>\n", | |
" <th>median_hh_inc_2021</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>cfips</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>1001</th>\n", | |
" <td>76.6</td>\n", | |
" <td>78.9</td>\n", | |
" <td>80.6</td>\n", | |
" <td>82.7</td>\n", | |
" <td>85.5</td>\n", | |
" <td>14.5</td>\n", | |
" <td>15.9</td>\n", | |
" <td>16.1</td>\n", | |
" <td>16.7</td>\n", | |
" <td>16.4</td>\n", | |
" <td>2.1</td>\n", | |
" <td>2.0</td>\n", | |
" <td>2.3</td>\n", | |
" <td>2.3</td>\n", | |
" <td>2.1</td>\n", | |
" <td>1.3</td>\n", | |
" <td>1.1</td>\n", | |
" <td>0.7</td>\n", | |
" <td>0.6</td>\n", | |
" <td>1.1</td>\n", | |
" <td>55317</td>\n", | |
" <td>58786.0</td>\n", | |
" <td>58731</td>\n", | |
" <td>57982.0</td>\n", | |
" <td>62660.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1003</th>\n", | |
" <td>74.5</td>\n", | |
" <td>78.1</td>\n", | |
" <td>81.8</td>\n", | |
" <td>85.1</td>\n", | |
" <td>87.9</td>\n", | |
" <td>20.4</td>\n", | |
" <td>20.7</td>\n", | |
" <td>21.0</td>\n", | |
" <td>20.2</td>\n", | |
" <td>20.6</td>\n", | |
" <td>3.2</td>\n", | |
" <td>3.4</td>\n", | |
" <td>3.7</td>\n", | |
" <td>3.4</td>\n", | |
" <td>3.5</td>\n", | |
" <td>1.4</td>\n", | |
" <td>1.3</td>\n", | |
" <td>1.4</td>\n", | |
" <td>1.0</td>\n", | |
" <td>1.3</td>\n", | |
" <td>52562</td>\n", | |
" <td>55962.0</td>\n", | |
" <td>58320</td>\n", | |
" <td>61756.0</td>\n", | |
" <td>64346.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1005</th>\n", | |
" <td>57.2</td>\n", | |
" <td>60.4</td>\n", | |
" <td>60.5</td>\n", | |
" <td>64.6</td>\n", | |
" <td>64.6</td>\n", | |
" <td>7.6</td>\n", | |
" <td>7.8</td>\n", | |
" <td>7.6</td>\n", | |
" <td>7.3</td>\n", | |
" <td>6.7</td>\n", | |
" <td>2.7</td>\n", | |
" <td>2.5</td>\n", | |
" <td>2.7</td>\n", | |
" <td>2.6</td>\n", | |
" <td>2.6</td>\n", | |
" <td>0.5</td>\n", | |
" <td>0.3</td>\n", | |
" <td>0.8</td>\n", | |
" <td>1.1</td>\n", | |
" <td>0.8</td>\n", | |
" <td>33368</td>\n", | |
" <td>34186.0</td>\n", | |
" <td>32525</td>\n", | |
" <td>34990.0</td>\n", | |
" <td>36422.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1007</th>\n", | |
" <td>62.0</td>\n", | |
" <td>66.1</td>\n", | |
" <td>69.2</td>\n", | |
" <td>76.1</td>\n", | |
" <td>74.6</td>\n", | |
" <td>8.1</td>\n", | |
" <td>7.6</td>\n", | |
" <td>6.5</td>\n", | |
" <td>7.4</td>\n", | |
" <td>7.9</td>\n", | |
" <td>1.0</td>\n", | |
" <td>1.4</td>\n", | |
" <td>1.5</td>\n", | |
" <td>1.6</td>\n", | |
" <td>1.1</td>\n", | |
" <td>1.2</td>\n", | |
" <td>1.4</td>\n", | |
" <td>1.6</td>\n", | |
" <td>1.7</td>\n", | |
" <td>2.1</td>\n", | |
" <td>43404</td>\n", | |
" <td>45340.0</td>\n", | |
" <td>47542</td>\n", | |
" <td>51721.0</td>\n", | |
" <td>54277.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1009</th>\n", | |
" <td>65.8</td>\n", | |
" <td>68.5</td>\n", | |
" <td>73.0</td>\n", | |
" <td>79.6</td>\n", | |
" <td>81.0</td>\n", | |
" <td>8.7</td>\n", | |
" <td>8.1</td>\n", | |
" <td>8.6</td>\n", | |
" <td>8.9</td>\n", | |
" <td>9.3</td>\n", | |
" <td>4.5</td>\n", | |
" <td>4.4</td>\n", | |
" <td>4.5</td>\n", | |
" <td>4.4</td>\n", | |
" <td>4.5</td>\n", | |
" <td>1.3</td>\n", | |
" <td>1.4</td>\n", | |
" <td>0.9</td>\n", | |
" <td>1.1</td>\n", | |
" <td>0.9</td>\n", | |
" <td>47412</td>\n", | |
" <td>48695.0</td>\n", | |
" <td>49358</td>\n", | |
" <td>48922.0</td>\n", | |
" <td>52830.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>...</th>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>56037</th>\n", | |
" <td>82.2</td>\n", | |
" <td>82.4</td>\n", | |
" <td>84.0</td>\n", | |
" <td>86.7</td>\n", | |
" <td>88.4</td>\n", | |
" <td>15.3</td>\n", | |
" <td>15.2</td>\n", | |
" <td>14.8</td>\n", | |
" <td>13.7</td>\n", | |
" <td>12.4</td>\n", | |
" <td>5.0</td>\n", | |
" <td>5.3</td>\n", | |
" <td>4.7</td>\n", | |
" <td>5.2</td>\n", | |
" <td>5.5</td>\n", | |
" <td>0.6</td>\n", | |
" <td>0.6</td>\n", | |
" <td>1.0</td>\n", | |
" <td>0.9</td>\n", | |
" <td>1.0</td>\n", | |
" <td>71083</td>\n", | |
" <td>73008.0</td>\n", | |
" <td>74843</td>\n", | |
" <td>73384.0</td>\n", | |
" <td>76668.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>56039</th>\n", | |
" <td>83.5</td>\n", | |
" <td>85.9</td>\n", | |
" <td>87.1</td>\n", | |
" <td>89.1</td>\n", | |
" <td>90.5</td>\n", | |
" <td>37.7</td>\n", | |
" <td>37.8</td>\n", | |
" <td>38.9</td>\n", | |
" <td>37.2</td>\n", | |
" <td>38.3</td>\n", | |
" <td>10.8</td>\n", | |
" <td>11.2</td>\n", | |
" <td>11.8</td>\n", | |
" <td>11.4</td>\n", | |
" <td>11.1</td>\n", | |
" <td>0.7</td>\n", | |
" <td>1.2</td>\n", | |
" <td>1.4</td>\n", | |
" <td>1.5</td>\n", | |
" <td>2.0</td>\n", | |
" <td>80049</td>\n", | |
" <td>83831.0</td>\n", | |
" <td>84678</td>\n", | |
" <td>87053.0</td>\n", | |
" <td>94498.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>56041</th>\n", | |
" <td>83.8</td>\n", | |
" <td>88.2</td>\n", | |
" <td>89.5</td>\n", | |
" <td>91.4</td>\n", | |
" <td>90.6</td>\n", | |
" <td>11.9</td>\n", | |
" <td>10.5</td>\n", | |
" <td>11.1</td>\n", | |
" <td>12.6</td>\n", | |
" <td>12.3</td>\n", | |
" <td>2.9</td>\n", | |
" <td>3.1</td>\n", | |
" <td>2.9</td>\n", | |
" <td>2.9</td>\n", | |
" <td>2.9</td>\n", | |
" <td>1.2</td>\n", | |
" <td>1.2</td>\n", | |
" <td>1.4</td>\n", | |
" <td>1.7</td>\n", | |
" <td>0.9</td>\n", | |
" <td>54672</td>\n", | |
" <td>58235.0</td>\n", | |
" <td>63403</td>\n", | |
" <td>72458.0</td>\n", | |
" <td>75106.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>56043</th>\n", | |
" <td>76.4</td>\n", | |
" <td>78.3</td>\n", | |
" <td>78.2</td>\n", | |
" <td>82.8</td>\n", | |
" <td>85.4</td>\n", | |
" <td>15.4</td>\n", | |
" <td>15.0</td>\n", | |
" <td>15.4</td>\n", | |
" <td>15.0</td>\n", | |
" <td>17.2</td>\n", | |
" <td>2.3</td>\n", | |
" <td>1.4</td>\n", | |
" <td>1.6</td>\n", | |
" <td>2.2</td>\n", | |
" <td>1.0</td>\n", | |
" <td>1.3</td>\n", | |
" <td>1.0</td>\n", | |
" <td>0.9</td>\n", | |
" <td>0.9</td>\n", | |
" <td>1.1</td>\n", | |
" <td>51362</td>\n", | |
" <td>53426.0</td>\n", | |
" <td>54158</td>\n", | |
" <td>57306.0</td>\n", | |
" <td>62271.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>56045</th>\n", | |
" <td>71.1</td>\n", | |
" <td>73.3</td>\n", | |
" <td>76.8</td>\n", | |
" <td>79.7</td>\n", | |
" <td>81.3</td>\n", | |
" <td>14.1</td>\n", | |
" <td>13.5</td>\n", | |
" <td>13.4</td>\n", | |
" <td>12.7</td>\n", | |
" <td>13.9</td>\n", | |
" <td>3.8</td>\n", | |
" <td>4.1</td>\n", | |
" <td>1.7</td>\n", | |
" <td>2.3</td>\n", | |
" <td>1.6</td>\n", | |
" <td>0.6</td>\n", | |
" <td>0.6</td>\n", | |
" <td>0.0</td>\n", | |
" <td>0.0</td>\n", | |
" <td>0.0</td>\n", | |
" <td>59605</td>\n", | |
" <td>52867.0</td>\n", | |
" <td>57031</td>\n", | |
" <td>53333.0</td>\n", | |
" <td>65566.0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>3142 rows × 25 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" pct_bb_2017 pct_bb_2018 pct_bb_2019 pct_bb_2020 pct_bb_2021 \\\n", | |
"cfips \n", | |
"1001 76.6 78.9 80.6 82.7 85.5 \n", | |
"1003 74.5 78.1 81.8 85.1 87.9 \n", | |
"1005 57.2 60.4 60.5 64.6 64.6 \n", | |
"1007 62.0 66.1 69.2 76.1 74.6 \n", | |
"1009 65.8 68.5 73.0 79.6 81.0 \n", | |
"... ... ... ... ... ... \n", | |
"56037 82.2 82.4 84.0 86.7 88.4 \n", | |
"56039 83.5 85.9 87.1 89.1 90.5 \n", | |
"56041 83.8 88.2 89.5 91.4 90.6 \n", | |
"56043 76.4 78.3 78.2 82.8 85.4 \n", | |
"56045 71.1 73.3 76.8 79.7 81.3 \n", | |
"\n", | |
" pct_college_2017 pct_college_2018 pct_college_2019 pct_college_2020 \\\n", | |
"cfips \n", | |
"1001 14.5 15.9 16.1 16.7 \n", | |
"1003 20.4 20.7 21.0 20.2 \n", | |
"1005 7.6 7.8 7.6 7.3 \n", | |
"1007 8.1 7.6 6.5 7.4 \n", | |
"1009 8.7 8.1 8.6 8.9 \n", | |
"... ... ... ... ... \n", | |
"56037 15.3 15.2 14.8 13.7 \n", | |
"56039 37.7 37.8 38.9 37.2 \n", | |
"56041 11.9 10.5 11.1 12.6 \n", | |
"56043 15.4 15.0 15.4 15.0 \n", | |
"56045 14.1 13.5 13.4 12.7 \n", | |
"\n", | |
" pct_college_2021 pct_foreign_born_2017 pct_foreign_born_2018 \\\n", | |
"cfips \n", | |
"1001 16.4 2.1 2.0 \n", | |
"1003 20.6 3.2 3.4 \n", | |
"1005 6.7 2.7 2.5 \n", | |
"1007 7.9 1.0 1.4 \n", | |
"1009 9.3 4.5 4.4 \n", | |
"... ... ... ... \n", | |
"56037 12.4 5.0 5.3 \n", | |
"56039 38.3 10.8 11.2 \n", | |
"56041 12.3 2.9 3.1 \n", | |
"56043 17.2 2.3 1.4 \n", | |
"56045 13.9 3.8 4.1 \n", | |
"\n", | |
" pct_foreign_born_2019 pct_foreign_born_2020 pct_foreign_born_2021 \\\n", | |
"cfips \n", | |
"1001 2.3 2.3 2.1 \n", | |
"1003 3.7 3.4 3.5 \n", | |
"1005 2.7 2.6 2.6 \n", | |
"1007 1.5 1.6 1.1 \n", | |
"1009 4.5 4.4 4.5 \n", | |
"... ... ... ... \n", | |
"56037 4.7 5.2 5.5 \n", | |
"56039 11.8 11.4 11.1 \n", | |
"56041 2.9 2.9 2.9 \n", | |
"56043 1.6 2.2 1.0 \n", | |
"56045 1.7 2.3 1.6 \n", | |
"\n", | |
" pct_it_workers_2017 pct_it_workers_2018 pct_it_workers_2019 \\\n", | |
"cfips \n", | |
"1001 1.3 1.1 0.7 \n", | |
"1003 1.4 1.3 1.4 \n", | |
"1005 0.5 0.3 0.8 \n", | |
"1007 1.2 1.4 1.6 \n", | |
"1009 1.3 1.4 0.9 \n", | |
"... ... ... ... \n", | |
"56037 0.6 0.6 1.0 \n", | |
"56039 0.7 1.2 1.4 \n", | |
"56041 1.2 1.2 1.4 \n", | |
"56043 1.3 1.0 0.9 \n", | |
"56045 0.6 0.6 0.0 \n", | |
"\n", | |
" pct_it_workers_2020 pct_it_workers_2021 median_hh_inc_2017 \\\n", | |
"cfips \n", | |
"1001 0.6 1.1 55317 \n", | |
"1003 1.0 1.3 52562 \n", | |
"1005 1.1 0.8 33368 \n", | |
"1007 1.7 2.1 43404 \n", | |
"1009 1.1 0.9 47412 \n", | |
"... ... ... ... \n", | |
"56037 0.9 1.0 71083 \n", | |
"56039 1.5 2.0 80049 \n", | |
"56041 1.7 0.9 54672 \n", | |
"56043 0.9 1.1 51362 \n", | |
"56045 0.0 0.0 59605 \n", | |
"\n", | |
" median_hh_inc_2018 median_hh_inc_2019 median_hh_inc_2020 \\\n", | |
"cfips \n", | |
"1001 58786.0 58731 57982.0 \n", | |
"1003 55962.0 58320 61756.0 \n", | |
"1005 34186.0 32525 34990.0 \n", | |
"1007 45340.0 47542 51721.0 \n", | |
"1009 48695.0 49358 48922.0 \n", | |
"... ... ... ... \n", | |
"56037 73008.0 74843 73384.0 \n", | |
"56039 83831.0 84678 87053.0 \n", | |
"56041 58235.0 63403 72458.0 \n", | |
"56043 53426.0 54158 57306.0 \n", | |
"56045 52867.0 57031 53333.0 \n", | |
"\n", | |
" median_hh_inc_2021 \n", | |
"cfips \n", | |
"1001 62660.0 \n", | |
"1003 64346.0 \n", | |
"1005 36422.0 \n", | |
"1007 54277.0 \n", | |
"1009 52830.0 \n", | |
"... ... \n", | |
"56037 76668.0 \n", | |
"56039 94498.0 \n", | |
"56041 75106.0 \n", | |
"56043 62271.0 \n", | |
"56045 65566.0 \n", | |
"\n", | |
"[3142 rows x 25 columns]" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pd.read_csv('data/census_starter.csv', index_col='cfips')" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Target engineering" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 52, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"/var/folders/8q/xqx16rw14bl7rqyybl7zk8sh0000gn/T/ipykernel_99328/3922010533.py:7: RuntimeWarning:\n", | |
"\n", | |
"invalid value encountered in scalar subtract\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"for cfips in (ds['cfips'].unique()):\n", | |
" indices = ds['cfips'].eq(cfips)\n", | |
" val = ds.loc[indices, 'target'].values.copy()\n", | |
" \n", | |
" for i in range(37, 2, -1):\n", | |
" threshold = 0.2 * np.mean(val[:i])\n", | |
" difa = abs(val[i] - val[i - 1])\n", | |
" if difa >= threshold:\n", | |
" val[:i] *= val[i] / val[i - 1]\n", | |
" \n", | |
" val[0] = val[1] * 0.99\n", | |
" ds.loc[indices, 'target'] = val" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Feature engineering" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"for shift in range(1, 39):\n", | |
" ds[f'target-{shift}'] = ds.groupby('cfips')['target'].shift(shift)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"ds['first_target_value'] = ds.groupby('cfips')['target'].transform('first')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 51, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>cfips</th>\n", | |
" <th>county</th>\n", | |
" <th>state</th>\n", | |
" <th>target-1</th>\n", | |
" <th>target-2</th>\n", | |
" <th>target-3</th>\n", | |
" <th>target-4</th>\n", | |
" <th>target-5</th>\n", | |
" <th>target-6</th>\n", | |
" <th>target-7</th>\n", | |
" <th>target-8</th>\n", | |
" <th>target-9</th>\n", | |
" <th>target-10</th>\n", | |
" <th>target-11</th>\n", | |
" <th>target-12</th>\n", | |
" <th>target-13</th>\n", | |
" <th>target-14</th>\n", | |
" <th>target-15</th>\n", | |
" <th>target-16</th>\n", | |
" <th>target-17</th>\n", | |
" <th>target-18</th>\n", | |
" <th>target-19</th>\n", | |
" <th>target-20</th>\n", | |
" <th>target-21</th>\n", | |
" <th>target-22</th>\n", | |
" <th>target-23</th>\n", | |
" <th>first_target_value</th>\n", | |
" <th>target_rolling_3_mean</th>\n", | |
" <th>target_rolling_6_mean</th>\n", | |
" <th>target_rolling_9_mean</th>\n", | |
" <th>target_rolling_12_mean</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>row_id</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>1001_2019-08-01</th>\n", | |
" <td>1001</td>\n", | |
" <td>Autauga County</td>\n", | |
" <td>Alabama</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>2.856021</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1001_2019-09-01</th>\n", | |
" <td>1001</td>\n", | |
" <td>Autauga County</td>\n", | |
" <td>Alabama</td>\n", | |
" <td>2.856021</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>2.856021</td>\n", | |
" <td>2.856021</td>\n", | |
" <td>2.856021</td>\n", | |
" <td>2.856021</td>\n", | |
" <td>2.856021</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1001_2019-10-01</th>\n", | |
" <td>1001</td>\n", | |
" <td>Autauga County</td>\n", | |
" <td>Alabama</td>\n", | |
" <td>2.884870</td>\n", | |
" <td>2.856021</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>2.856021</td>\n", | |
" <td>2.870446</td>\n", | |
" <td>2.870446</td>\n", | |
" <td>2.870446</td>\n", | |
" <td>2.870446</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1001_2019-11-01</th>\n", | |
" <td>1001</td>\n", | |
" <td>Autauga County</td>\n", | |
" <td>Alabama</td>\n", | |
" <td>3.055843</td>\n", | |
" <td>2.884870</td>\n", | |
" <td>2.856021</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>2.856021</td>\n", | |
" <td>2.970357</td>\n", | |
" <td>2.932245</td>\n", | |
" <td>2.932245</td>\n", | |
" <td>2.932245</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1001_2019-12-01</th>\n", | |
" <td>1001</td>\n", | |
" <td>Autauga County</td>\n", | |
" <td>Alabama</td>\n", | |
" <td>2.993233</td>\n", | |
" <td>3.055843</td>\n", | |
" <td>2.884870</td>\n", | |
" <td>2.856021</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>2.856021</td>\n", | |
" <td>3.024538</td>\n", | |
" <td>2.947492</td>\n", | |
" <td>2.947492</td>\n", | |
" <td>2.947492</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" cfips county state target-1 target-2 target-3 \\\n", | |
"row_id \n", | |
"1001_2019-08-01 1001 Autauga County Alabama NaN NaN NaN \n", | |
"1001_2019-09-01 1001 Autauga County Alabama 2.856021 NaN NaN \n", | |
"1001_2019-10-01 1001 Autauga County Alabama 2.884870 2.856021 NaN \n", | |
"1001_2019-11-01 1001 Autauga County Alabama 3.055843 2.884870 2.856021 \n", | |
"1001_2019-12-01 1001 Autauga County Alabama 2.993233 3.055843 2.884870 \n", | |
"\n", | |
" target-4 target-5 target-6 target-7 target-8 target-9 \\\n", | |
"row_id \n", | |
"1001_2019-08-01 NaN NaN NaN NaN NaN NaN \n", | |
"1001_2019-09-01 NaN NaN NaN NaN NaN NaN \n", | |
"1001_2019-10-01 NaN NaN NaN NaN NaN NaN \n", | |
"1001_2019-11-01 NaN NaN NaN NaN NaN NaN \n", | |
"1001_2019-12-01 2.856021 NaN NaN NaN NaN NaN \n", | |
"\n", | |
" target-10 target-11 target-12 target-13 target-14 \\\n", | |
"row_id \n", | |
"1001_2019-08-01 NaN NaN NaN NaN NaN \n", | |
"1001_2019-09-01 NaN NaN NaN NaN NaN \n", | |
"1001_2019-10-01 NaN NaN NaN NaN NaN \n", | |
"1001_2019-11-01 NaN NaN NaN NaN NaN \n", | |
"1001_2019-12-01 NaN NaN NaN NaN NaN \n", | |
"\n", | |
" target-15 target-16 target-17 target-18 target-19 \\\n", | |
"row_id \n", | |
"1001_2019-08-01 NaN NaN NaN NaN NaN \n", | |
"1001_2019-09-01 NaN NaN NaN NaN NaN \n", | |
"1001_2019-10-01 NaN NaN NaN NaN NaN \n", | |
"1001_2019-11-01 NaN NaN NaN NaN NaN \n", | |
"1001_2019-12-01 NaN NaN NaN NaN NaN \n", | |
"\n", | |
" target-20 target-21 target-22 target-23 \\\n", | |
"row_id \n", | |
"1001_2019-08-01 NaN NaN NaN NaN \n", | |
"1001_2019-09-01 NaN NaN NaN NaN \n", | |
"1001_2019-10-01 NaN NaN NaN NaN \n", | |
"1001_2019-11-01 NaN NaN NaN NaN \n", | |
"1001_2019-12-01 NaN NaN NaN NaN \n", | |
"\n", | |
" first_target_value target_rolling_3_mean \\\n", | |
"row_id \n", | |
"1001_2019-08-01 2.856021 NaN \n", | |
"1001_2019-09-01 2.856021 2.856021 \n", | |
"1001_2019-10-01 2.856021 2.870446 \n", | |
"1001_2019-11-01 2.856021 2.970357 \n", | |
"1001_2019-12-01 2.856021 3.024538 \n", | |
"\n", | |
" target_rolling_6_mean target_rolling_9_mean \\\n", | |
"row_id \n", | |
"1001_2019-08-01 NaN NaN \n", | |
"1001_2019-09-01 2.856021 2.856021 \n", | |
"1001_2019-10-01 2.870446 2.870446 \n", | |
"1001_2019-11-01 2.932245 2.932245 \n", | |
"1001_2019-12-01 2.947492 2.947492 \n", | |
"\n", | |
" target_rolling_12_mean \n", | |
"row_id \n", | |
"1001_2019-08-01 NaN \n", | |
"1001_2019-09-01 2.856021 \n", | |
"1001_2019-10-01 2.870446 \n", | |
"1001_2019-11-01 2.932245 \n", | |
"1001_2019-12-01 2.947492 " | |
] | |
}, | |
"execution_count": 51, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"from scipy import stats\n", | |
"from sklearn import preprocessing\n", | |
"\n", | |
"not_features = ['target', 'month', 'active', 'is_train', 'lat', 'lng']\n", | |
"\n", | |
"def extract_features(X):\n", | |
" Z = X.copy()\n", | |
" for w in [3, 6, 9, 12]:\n", | |
" Z[f'target_rolling_{w}_mean'] = (\n", | |
" Z[[f'target-{k}' for k in range(1, w)]]\n", | |
" .mean(axis='columns')\n", | |
" )\n", | |
" Z = Z.drop(columns=(\n", | |
" not_features +\n", | |
" [f'target-{k}' for k in range(24, 39)]\n", | |
" ))\n", | |
" return Z\n", | |
"\n", | |
"extract_features(ds).head()" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Baseline" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"1.4888\n" | |
] | |
} | |
], | |
"source": [ | |
"D = ds.copy()\n", | |
"oof = D[D.is_train & (D.month > '2019-08-01')]\n", | |
"print(smape(oof.target, oof['target-1']).round(4))\n", | |
"D[~D.is_train]['target-1'].rename('microbusiness_density').to_csv('baseline_submission.csv.zip', header=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 59, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"cfips\n", | |
"56037 4.0933\n", | |
"56039 7.5138\n", | |
"56041 3.7991\n", | |
"56043 3.7272\n", | |
"56045 3.1007\n", | |
"dtype: float64" | |
] | |
}, | |
"execution_count": 59, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"(\n", | |
" D[D.is_train & (D.month > '2019-08-01')]\n", | |
" .groupby('cfips')\n", | |
" .apply(lambda df: smape(df.target, df['target-1']).round(4))\n", | |
" .tail(5)\n", | |
")" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Learning" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[620]\tfit's huber: 1.83369\tval's huber: 1.84429\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[186]\tfit's huber: 1.8359\tval's huber: 1.84012\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[215]\tfit's huber: 1.84221\tval's huber: 1.8274\n", | |
"t+1 - SMAPE: 1.6249\n", | |
"\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[280]\tfit's huber: 1.83373\tval's huber: 1.84432\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[483]\tfit's huber: 1.83588\tval's huber: 1.84042\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[210]\tfit's huber: 1.84222\tval's huber: 1.82718\n", | |
"t+2 - SMAPE: 1.6964\n", | |
"\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[266]\tfit's huber: 1.83374\tval's huber: 1.84421\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[344]\tfit's huber: 1.8359\tval's huber: 1.8404\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[189]\tfit's huber: 1.84223\tval's huber: 1.82705\n", | |
"t+3 - SMAPE: 1.7736\n", | |
"\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[486]\tfit's huber: 1.83373\tval's huber: 1.84417\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[181]\tfit's huber: 1.83593\tval's huber: 1.84055\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[225]\tfit's huber: 1.84224\tval's huber: 1.827\n", | |
"t+4 - SMAPE: 1.8577\n", | |
"\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[309]\tfit's huber: 1.83376\tval's huber: 1.84409\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[211]\tfit's huber: 1.83594\tval's huber: 1.84063\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[220]\tfit's huber: 1.84225\tval's huber: 1.82686\n", | |
"t+5 - SMAPE: 1.9731\n", | |
"\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[386]\tfit's huber: 1.83376\tval's huber: 1.84419\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[201]\tfit's huber: 1.83596\tval's huber: 1.8409\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[209]\tfit's huber: 1.84227\tval's huber: 1.82693\n", | |
"t+6 - SMAPE: 2.1009\n", | |
"\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[819]\tfit's huber: 1.83374\tval's huber: 1.84413\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[248]\tfit's huber: 1.83597\tval's huber: 1.84089\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[385]\tfit's huber: 1.84227\tval's huber: 1.82687\n", | |
"t+7 - SMAPE: 2.2476\n", | |
"\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[494]\tfit's huber: 1.83378\tval's huber: 1.84426\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[220]\tfit's huber: 1.83601\tval's huber: 1.84106\n", | |
"Training until validation scores don't improve for 50 rounds\n", | |
"Early stopping, best iteration is:\n", | |
"[176]\tfit's huber: 1.84232\tval's huber: 1.82694\n", | |
"t+8 - SMAPE: 2.4071\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"import warnings\n", | |
"import chime\n", | |
"import lightgbm as lgb\n", | |
"from sklearn import compose\n", | |
"from sklearn import ensemble\n", | |
"from sklearn import model_selection\n", | |
"\n", | |
"chime.theme('zelda')\n", | |
"\n", | |
"cut_off = pd.Timestamp(year=2022, month=10, day=1)\n", | |
"one_month = pd.tseries.offsets.DateOffset(months=1)\n", | |
"horizon = 8\n", | |
"\n", | |
"D = ds.copy()\n", | |
"is_train = (\n", | |
" D.is_train &\n", | |
" (D.month > pd.Timestamp(year=2019, month=8, day=1)) # skip first month\n", | |
" & D.target.notnull()\n", | |
" & ~D.target.eq(np.inf)\n", | |
")\n", | |
"all_oof = pd.DataFrame(index=D[is_train].index)\n", | |
"\n", | |
"model = lgb.LGBMRegressor(\n", | |
" n_estimators=1000,\n", | |
" verbosity=-1,\n", | |
" objective='huber',\n", | |
" random_state=42,\n", | |
" max_depth=12,\n", | |
" learning_rate=0.08,\n", | |
" min_child_samples=20\n", | |
")\n", | |
"model = compose.TransformedTargetRegressor(\n", | |
" regressor=model,\n", | |
" func=np.log1p,\n", | |
" inverse_func=np.expm1\n", | |
")\n", | |
"\n", | |
"cv = model_selection.KFold(n_splits=3, shuffle=True, random_state=42)\n", | |
"\n", | |
"perf = []\n", | |
"\n", | |
"for h in range(1, horizon + 1):\n", | |
"\n", | |
" # We want to predict h step(s) ahead\n", | |
" is_test = D.month == (cut_off + h * one_month)\n", | |
" is_test_next = D.month == (cut_off + (h + 1) * one_month)\n", | |
"\n", | |
" # We will store out-of-fold and test predictions within each round\n", | |
" oof = pd.Series(0.0, index=D[is_train].index)\n", | |
" predictions = pd.Series(0.0, index=D[is_test].index)\n", | |
"\n", | |
" # We do some feature engineering here to account for the fact the \n", | |
" # target is edited in the training set at each step\n", | |
" features = extract_features(D)\n", | |
" X_train = features[is_train]\n", | |
" y_train = D[is_train].target\n", | |
" X_test = features[is_test]\n", | |
"\n", | |
" # Cross-validated fit/predict\n", | |
" for fit_idx, val_idx in cv.split(X_train, y_train):\n", | |
" X_fit = X_train.iloc[fit_idx]\n", | |
" X_val = X_train.iloc[val_idx]\n", | |
" y_fit = y_train.iloc[fit_idx]\n", | |
" y_val = y_train.iloc[val_idx]\n", | |
"\n", | |
" with warnings.catch_warnings(category=UserWarning):\n", | |
" warnings.simplefilter('ignore')\n", | |
" model.fit(\n", | |
" X_fit, y_fit,\n", | |
" eval_set=[(X_fit, y_fit), (X_val, y_val)],\n", | |
" eval_names=('fit', 'val'),\n", | |
" categorical_feature=[\"state\"],\n", | |
" callbacks=[\n", | |
" lgb.early_stopping(50),\n", | |
" #lgb.print_evaluation(100)\n", | |
" ]\n", | |
" )\n", | |
" oof.iloc[val_idx] = model.predict(X_val)\n", | |
" predictions += model.predict(X_test) / cv.n_splits\n", | |
"\n", | |
" all_oof.loc[oof.index, h] = oof.values\n", | |
" D.loc[predictions.index, 'target'] = predictions.values\n", | |
" msg = f't+{h} - SMAPE: {smape(y_train, oof):.4f}'\n", | |
" perf.append(msg)\n", | |
" print(msg, end='\\n\\n')\n", | |
"\n", | |
" # Update the training and test sets for the next step ahead\n", | |
" for k in range(h, 1, -1):\n", | |
" D.loc[is_train, f'target-{k}'] = D[is_train].groupby('cfips')[f'target-{k - 1}'].shift(1)\n", | |
" if h < horizon:\n", | |
" D.loc[is_test_next, f'target-{k}'] = D[is_test][f'target-{k - 1}'].values\n", | |
" D.loc[is_train, 'target-1'] = oof.values\n", | |
" if h < horizon:\n", | |
" D.loc[is_test_next, 'target-1'] = predictions.values\n", | |
"\n", | |
"# Store the test predictions in the original dataset\n", | |
"ds.loc[~D.is_train, 'target'] = D.loc[~D.is_train, 'target'].values\n", | |
"\n", | |
"#chime.success()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"t+1 - SMAPE: 1.6249\n", | |
"t+2 - SMAPE: 1.6964\n", | |
"t+3 - SMAPE: 1.7736\n", | |
"t+4 - SMAPE: 1.8577\n", | |
"t+5 - SMAPE: 1.9731\n", | |
"t+6 - SMAPE: 2.1009\n", | |
"t+7 - SMAPE: 2.2476\n", | |
"t+8 - SMAPE: 2.4071\n" | |
] | |
} | |
], | |
"source": [ | |
"print('\\n'.join(perf))" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"t+1 - SMAPE: 1.6759\n", | |
"t+2 - SMAPE: 1.7387\n", | |
"t+3 - SMAPE: 1.8112\n", | |
"t+4 - SMAPE: 1.9043\n", | |
"t+5 - SMAPE: 2.0242\n", | |
"t+6 - SMAPE: 2.1616\n", | |
"t+7 - SMAPE: 2.3316\n", | |
"t+8 - SMAPE: 2.5400\n", | |
"\n", | |
"Public score: 1.2322" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Postprocessing" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"max_active = ds.groupby('cfips')['active'].max()\n", | |
"last_known_target_values = ds.query('is_train').groupby('cfips')['target'].last().to_dict()\n", | |
"\n", | |
"for cfips in max_active[max_active < 100].index:\n", | |
" ds.loc[ds.cfips.eq(cfips) & ~ds.is_train, 'target'] = last_known_target_values[cfips]" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Visual checks" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 48, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"application/vnd.plotly.v1+json": { | |
"config": { | |
"plotlyServerURL": "https://plot.ly" | |
}, | |
"data": [ | |
{ | |
"hovertemplate": "variable=target<br>row_id=%{x}<br>value=%{y}<extra></extra>", | |
"legendgroup": "target", | |
"marker": { | |
"color": "#636efa", | |
"symbol": "circle" | |
}, | |
"mode": "markers", | |
"name": "target", | |
"orientation": "v", | |
"showlegend": true, | |
"type": "scatter", | |
"x": [ | |
"2019-08-01T00:00:00", | |
"2019-09-01T00:00:00", | |
"2019-10-01T00:00:00", | |
"2019-11-01T00:00:00", | |
"2019-12-01T00:00:00", | |
"2020-01-01T00:00:00", | |
"2020-02-01T00:00:00", | |
"2020-03-01T00:00:00", | |
"2020-04-01T00:00:00", | |
"2020-05-01T00:00:00", | |
"2020-06-01T00:00:00", | |
"2020-07-01T00:00:00", | |
"2020-08-01T00:00:00", | |
"2020-09-01T00:00:00", | |
"2020-10-01T00:00:00", | |
"2020-11-01T00:00:00", | |
"2020-12-01T00:00:00", | |
"2021-01-01T00:00:00", | |
"2021-02-01T00:00:00", | |
"2021-03-01T00:00:00", | |
"2021-04-01T00:00:00", | |
"2021-05-01T00:00:00", | |
"2021-06-01T00:00:00", | |
"2021-07-01T00:00:00", | |
"2021-08-01T00:00:00", | |
"2021-09-01T00:00:00", | |
"2021-10-01T00:00:00", | |
"2021-11-01T00:00:00", | |
"2021-12-01T00:00:00", | |
"2022-01-01T00:00:00", | |
"2022-02-01T00:00:00", | |
"2022-03-01T00:00:00", | |
"2022-04-01T00:00:00", | |
"2022-05-01T00:00:00", | |
"2022-06-01T00:00:00", | |
"2022-07-01T00:00:00", | |
"2022-08-01T00:00:00", | |
"2022-09-01T00:00:00", | |
"2022-10-01T00:00:00", | |
"2022-11-01T00:00:00", | |
"2022-12-01T00:00:00", | |
"2023-01-01T00:00:00", | |
"2023-02-01T00:00:00", | |
"2023-03-01T00:00:00", | |
"2023-04-01T00:00:00", | |
"2023-05-01T00:00:00", | |
"2023-06-01T00:00:00" | |
], | |
"xaxis": "x", | |
"y": [ | |
1.1863674899999999, | |
1.198351, | |
1.2462851, | |
1.3229796, | |
1.3229796, | |
1.3318003, | |
1.3605442, | |
1.3605442, | |
1.4084507, | |
1.3892881, | |
1.4180321, | |
1.4467759, | |
1.4467759, | |
1.4371946, | |
1.4371946, | |
1.4371946, | |
1.3605442, | |
1.3746035, | |
1.1823512, | |
1.1535134, | |
1.1535134, | |
1.1535134, | |
1.1727387, | |
1.1439008, | |
1.1535134, | |
1.1342882, | |
1.1439008, | |
1.1535134, | |
1.163126, | |
1.1787628, | |
1.1592791, | |
1.1495372, | |
1.1300536, | |
1.1203117, | |
1.1203117, | |
1.1592791, | |
1.1592791, | |
1.1787628, | |
1.1690209, | |
1.1742170243799857, | |
1.1798589608463734, | |
1.1777659730942878, | |
1.1801574873643013, | |
1.1773674079570413, | |
1.1800574683838128, | |
1.1798943150950045, | |
1.175360447568515 | |
], | |
"yaxis": "y" | |
}, | |
{ | |
"hovertemplate": "variable=oof<br>row_id=%{x}<br>value=%{y}<extra></extra>", | |
"legendgroup": "oof", | |
"marker": { | |
"color": "#EF553B", | |
"symbol": "circle" | |
}, | |
"mode": "markers", | |
"name": "oof", | |
"orientation": "v", | |
"showlegend": true, | |
"type": "scatter", | |
"x": [ | |
"2019-08-01T00:00:00", | |
"2019-09-01T00:00:00", | |
"2019-10-01T00:00:00", | |
"2019-11-01T00:00:00", | |
"2019-12-01T00:00:00", | |
"2020-01-01T00:00:00", | |
"2020-02-01T00:00:00", | |
"2020-03-01T00:00:00", | |
"2020-04-01T00:00:00", | |
"2020-05-01T00:00:00", | |
"2020-06-01T00:00:00", | |
"2020-07-01T00:00:00", | |
"2020-08-01T00:00:00", | |
"2020-09-01T00:00:00", | |
"2020-10-01T00:00:00", | |
"2020-11-01T00:00:00", | |
"2020-12-01T00:00:00", | |
"2021-01-01T00:00:00", | |
"2021-02-01T00:00:00", | |
"2021-03-01T00:00:00", | |
"2021-04-01T00:00:00", | |
"2021-05-01T00:00:00", | |
"2021-06-01T00:00:00", | |
"2021-07-01T00:00:00", | |
"2021-08-01T00:00:00", | |
"2021-09-01T00:00:00", | |
"2021-10-01T00:00:00", | |
"2021-11-01T00:00:00", | |
"2021-12-01T00:00:00", | |
"2022-01-01T00:00:00", | |
"2022-02-01T00:00:00", | |
"2022-03-01T00:00:00", | |
"2022-04-01T00:00:00", | |
"2022-05-01T00:00:00", | |
"2022-06-01T00:00:00", | |
"2022-07-01T00:00:00", | |
"2022-08-01T00:00:00", | |
"2022-09-01T00:00:00", | |
"2022-10-01T00:00:00", | |
"2022-11-01T00:00:00", | |
"2022-12-01T00:00:00", | |
"2023-01-01T00:00:00", | |
"2023-02-01T00:00:00", | |
"2023-03-01T00:00:00", | |
"2023-04-01T00:00:00", | |
"2023-05-01T00:00:00", | |
"2023-06-01T00:00:00" | |
], | |
"xaxis": "x", | |
"y": [ | |
null, | |
1.1959122463923015, | |
1.201825426840018, | |
1.2516713420314654, | |
1.3256473439159162, | |
1.3360062943039246, | |
1.317595650169866, | |
1.3658255229837737, | |
1.3661545595411593, | |
1.4002722963705525, | |
1.3872547624875398, | |
1.415241594726766, | |
1.4538018058794624, | |
1.475609697508276, | |
1.4303118638345949, | |
1.4277306780456036, | |
1.4248567903792644, | |
1.3607091219844947, | |
1.3486925751617207, | |
1.188812633121493, | |
1.1652911294581, | |
1.1609494986362252, | |
1.149864238919837, | |
1.184660282606063, | |
1.136839376822894, | |
1.154783244521418, | |
1.1397259733701888, | |
1.1397259733701888, | |
1.1550087765827737, | |
1.1666870403075489, | |
1.1854843433482365, | |
1.1657697783942198, | |
1.140725970288753, | |
1.1399989867314626, | |
1.128819292018842, | |
1.1269752474025307, | |
1.1554346005123686, | |
1.16180678136804, | |
1.1843054053899484, | |
null, | |
null, | |
null, | |
null, | |
null, | |
null, | |
null, | |
null | |
], | |
"yaxis": "y" | |
} | |
], | |
"layout": { | |
"legend": { | |
"title": { | |
"text": "variable" | |
}, | |
"tracegroupgap": 0 | |
}, | |
"shapes": [ | |
{ | |
"fillcolor": "yellow", | |
"opacity": 0.2, | |
"type": "rect", | |
"x0": "2022-11-01T00:00:00", | |
"x1": "2023-06-01T00:00:00", | |
"xref": "x", | |
"y0": 0, | |
"y1": 1, | |
"yref": "y domain" | |
} | |
], | |
"template": { | |
"data": { | |
"bar": [ | |
{ | |
"error_x": { | |
"color": "#2a3f5f" | |
}, | |
"error_y": { | |
"color": "#2a3f5f" | |
}, | |
"marker": { | |
"line": { | |
"color": "#E5ECF6", | |
"width": 0.5 | |
}, | |
"pattern": { | |
"fillmode": "overlay", | |
"size": 10, | |
"solidity": 0.2 | |
} | |
}, | |
"type": "bar" | |
} | |
], | |
"barpolar": [ | |
{ | |
"marker": { | |
"line": { | |
"color": "#E5ECF6", | |
"width": 0.5 | |
}, | |
"pattern": { | |
"fillmode": "overlay", | |
"size": 10, | |
"solidity": 0.2 | |
} | |
}, | |
"type": "barpolar" | |
} | |
], | |
"carpet": [ | |
{ | |
"aaxis": { | |
"endlinecolor": "#2a3f5f", | |
"gridcolor": "white", | |
"linecolor": "white", | |
"minorgridcolor": "white", | |
"startlinecolor": "#2a3f5f" | |
}, | |
"baxis": { | |
"endlinecolor": "#2a3f5f", | |
"gridcolor": "white", | |
"linecolor": "white", | |
"minorgridcolor": "white", | |
"startlinecolor": "#2a3f5f" | |
}, | |
"type": "carpet" | |
} | |
], | |
"choropleth": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"type": "choropleth" | |
} | |
], | |
"contour": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"colorscale": [ | |
[ | |
0, | |
"#0d0887" | |
], | |
[ | |
0.1111111111111111, | |
"#46039f" | |
], | |
[ | |
0.2222222222222222, | |
"#7201a8" | |
], | |
[ | |
0.3333333333333333, | |
"#9c179e" | |
], | |
[ | |
0.4444444444444444, | |
"#bd3786" | |
], | |
[ | |
0.5555555555555556, | |
"#d8576b" | |
], | |
[ | |
0.6666666666666666, | |
"#ed7953" | |
], | |
[ | |
0.7777777777777778, | |
"#fb9f3a" | |
], | |
[ | |
0.8888888888888888, | |
"#fdca26" | |
], | |
[ | |
1, | |
"#f0f921" | |
] | |
], | |
"type": "contour" | |
} | |
], | |
"contourcarpet": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"type": "contourcarpet" | |
} | |
], | |
"heatmap": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"colorscale": [ | |
[ | |
0, | |
"#0d0887" | |
], | |
[ | |
0.1111111111111111, | |
"#46039f" | |
], | |
[ | |
0.2222222222222222, | |
"#7201a8" | |
], | |
[ | |
0.3333333333333333, | |
"#9c179e" | |
], | |
[ | |
0.4444444444444444, | |
"#bd3786" | |
], | |
[ | |
0.5555555555555556, | |
"#d8576b" | |
], | |
[ | |
0.6666666666666666, | |
"#ed7953" | |
], | |
[ | |
0.7777777777777778, | |
"#fb9f3a" | |
], | |
[ | |
0.8888888888888888, | |
"#fdca26" | |
], | |
[ | |
1, | |
"#f0f921" | |
] | |
], | |
"type": "heatmap" | |
} | |
], | |
"heatmapgl": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"colorscale": [ | |
[ | |
0, | |
"#0d0887" | |
], | |
[ | |
0.1111111111111111, | |
"#46039f" | |
], | |
[ | |
0.2222222222222222, | |
"#7201a8" | |
], | |
[ | |
0.3333333333333333, | |
"#9c179e" | |
], | |
[ | |
0.4444444444444444, | |
"#bd3786" | |
], | |
[ | |
0.5555555555555556, | |
"#d8576b" | |
], | |
[ | |
0.6666666666666666, | |
"#ed7953" | |
], | |
[ | |
0.7777777777777778, | |
"#fb9f3a" | |
], | |
[ | |
0.8888888888888888, | |
"#fdca26" | |
], | |
[ | |
1, | |
"#f0f921" | |
] | |
], | |
"type": "heatmapgl" | |
} | |
], | |
"histogram": [ | |
{ | |
"marker": { | |
"pattern": { | |
"fillmode": "overlay", | |
"size": 10, | |
"solidity": 0.2 | |
} | |
}, | |
"type": "histogram" | |
} | |
], | |
"histogram2d": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"colorscale": [ | |
[ | |
0, | |
"#0d0887" | |
], | |
[ | |
0.1111111111111111, | |
"#46039f" | |
], | |
[ | |
0.2222222222222222, | |
"#7201a8" | |
], | |
[ | |
0.3333333333333333, | |
"#9c179e" | |
], | |
[ | |
0.4444444444444444, | |
"#bd3786" | |
], | |
[ | |
0.5555555555555556, | |
"#d8576b" | |
], | |
[ | |
0.6666666666666666, | |
"#ed7953" | |
], | |
[ | |
0.7777777777777778, | |
"#fb9f3a" | |
], | |
[ | |
0.8888888888888888, | |
"#fdca26" | |
], | |
[ | |
1, | |
"#f0f921" | |
] | |
], | |
"type": "histogram2d" | |
} | |
], | |
"histogram2dcontour": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"colorscale": [ | |
[ | |
0, | |
"#0d0887" | |
], | |
[ | |
0.1111111111111111, | |
"#46039f" | |
], | |
[ | |
0.2222222222222222, | |
"#7201a8" | |
], | |
[ | |
0.3333333333333333, | |
"#9c179e" | |
], | |
[ | |
0.4444444444444444, | |
"#bd3786" | |
], | |
[ | |
0.5555555555555556, | |
"#d8576b" | |
], | |
[ | |
0.6666666666666666, | |
"#ed7953" | |
], | |
[ | |
0.7777777777777778, | |
"#fb9f3a" | |
], | |
[ | |
0.8888888888888888, | |
"#fdca26" | |
], | |
[ | |
1, | |
"#f0f921" | |
] | |
], | |
"type": "histogram2dcontour" | |
} | |
], | |
"mesh3d": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"type": "mesh3d" | |
} | |
], | |
"parcoords": [ | |
{ | |
"line": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "parcoords" | |
} | |
], | |
"pie": [ | |
{ | |
"automargin": true, | |
"type": "pie" | |
} | |
], | |
"scatter": [ | |
{ | |
"fillpattern": { | |
"fillmode": "overlay", | |
"size": 10, | |
"solidity": 0.2 | |
}, | |
"type": "scatter" | |
} | |
], | |
"scatter3d": [ | |
{ | |
"line": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"marker": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "scatter3d" | |
} | |
], | |
"scattercarpet": [ | |
{ | |
"marker": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "scattercarpet" | |
} | |
], | |
"scattergeo": [ | |
{ | |
"marker": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "scattergeo" | |
} | |
], | |
"scattergl": [ | |
{ | |
"marker": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "scattergl" | |
} | |
], | |
"scattermapbox": [ | |
{ | |
"marker": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "scattermapbox" | |
} | |
], | |
"scatterpolar": [ | |
{ | |
"marker": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "scatterpolar" | |
} | |
], | |
"scatterpolargl": [ | |
{ | |
"marker": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "scatterpolargl" | |
} | |
], | |
"scatterternary": [ | |
{ | |
"marker": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"type": "scatterternary" | |
} | |
], | |
"surface": [ | |
{ | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
}, | |
"colorscale": [ | |
[ | |
0, | |
"#0d0887" | |
], | |
[ | |
0.1111111111111111, | |
"#46039f" | |
], | |
[ | |
0.2222222222222222, | |
"#7201a8" | |
], | |
[ | |
0.3333333333333333, | |
"#9c179e" | |
], | |
[ | |
0.4444444444444444, | |
"#bd3786" | |
], | |
[ | |
0.5555555555555556, | |
"#d8576b" | |
], | |
[ | |
0.6666666666666666, | |
"#ed7953" | |
], | |
[ | |
0.7777777777777778, | |
"#fb9f3a" | |
], | |
[ | |
0.8888888888888888, | |
"#fdca26" | |
], | |
[ | |
1, | |
"#f0f921" | |
] | |
], | |
"type": "surface" | |
} | |
], | |
"table": [ | |
{ | |
"cells": { | |
"fill": { | |
"color": "#EBF0F8" | |
}, | |
"line": { | |
"color": "white" | |
} | |
}, | |
"header": { | |
"fill": { | |
"color": "#C8D4E3" | |
}, | |
"line": { | |
"color": "white" | |
} | |
}, | |
"type": "table" | |
} | |
] | |
}, | |
"layout": { | |
"annotationdefaults": { | |
"arrowcolor": "#2a3f5f", | |
"arrowhead": 0, | |
"arrowwidth": 1 | |
}, | |
"autotypenumbers": "strict", | |
"coloraxis": { | |
"colorbar": { | |
"outlinewidth": 0, | |
"ticks": "" | |
} | |
}, | |
"colorscale": { | |
"diverging": [ | |
[ | |
0, | |
"#8e0152" | |
], | |
[ | |
0.1, | |
"#c51b7d" | |
], | |
[ | |
0.2, | |
"#de77ae" | |
], | |
[ | |
0.3, | |
"#f1b6da" | |
], | |
[ | |
0.4, | |
"#fde0ef" | |
], | |
[ | |
0.5, | |
"#f7f7f7" | |
], | |
[ | |
0.6, | |
"#e6f5d0" | |
], | |
[ | |
0.7, | |
"#b8e186" | |
], | |
[ | |
0.8, | |
"#7fbc41" | |
], | |
[ | |
0.9, | |
"#4d9221" | |
], | |
[ | |
1, | |
"#276419" | |
] | |
], | |
"sequential": [ | |
[ | |
0, | |
"#0d0887" | |
], | |
[ | |
0.1111111111111111, | |
"#46039f" | |
], | |
[ | |
0.2222222222222222, | |
"#7201a8" | |
], | |
[ | |
0.3333333333333333, | |
"#9c179e" | |
], | |
[ | |
0.4444444444444444, | |
"#bd3786" | |
], | |
[ | |
0.5555555555555556, | |
"#d8576b" | |
], | |
[ | |
0.6666666666666666, | |
"#ed7953" | |
], | |
[ | |
0.7777777777777778, | |
"#fb9f3a" | |
], | |
[ | |
0.8888888888888888, | |
"#fdca26" | |
], | |
[ | |
1, | |
"#f0f921" | |
] | |
], | |
"sequentialminus": [ | |
[ | |
0, | |
"#0d0887" | |
], | |
[ | |
0.1111111111111111, | |
"#46039f" | |
], | |
[ | |
0.2222222222222222, | |
"#7201a8" | |
], | |
[ | |
0.3333333333333333, | |
"#9c179e" | |
], | |
[ | |
0.4444444444444444, | |
"#bd3786" | |
], | |
[ | |
0.5555555555555556, | |
"#d8576b" | |
], | |
[ | |
0.6666666666666666, | |
"#ed7953" | |
], | |
[ | |
0.7777777777777778, | |
"#fb9f3a" | |
], | |
[ | |
0.8888888888888888, | |
"#fdca26" | |
], | |
[ | |
1, | |
"#f0f921" | |
] | |
] | |
}, | |
"colorway": [ | |
"#636efa", | |
"#EF553B", | |
"#00cc96", | |
"#ab63fa", | |
"#FFA15A", | |
"#19d3f3", | |
"#FF6692", | |
"#B6E880", | |
"#FF97FF", | |
"#FECB52" | |
], | |
"font": { | |
"color": "#2a3f5f" | |
}, | |
"geo": { | |
"bgcolor": "white", | |
"lakecolor": "white", | |
"landcolor": "#E5ECF6", | |
"showlakes": true, | |
"showland": true, | |
"subunitcolor": "white" | |
}, | |
"hoverlabel": { | |
"align": "left" | |
}, | |
"hovermode": "closest", | |
"mapbox": { | |
"style": "light" | |
}, | |
"paper_bgcolor": "white", | |
"plot_bgcolor": "#E5ECF6", | |
"polar": { | |
"angularaxis": { | |
"gridcolor": "white", | |
"linecolor": "white", | |
"ticks": "" | |
}, | |
"bgcolor": "#E5ECF6", | |
"radialaxis": { | |
"gridcolor": "white", | |
"linecolor": "white", | |
"ticks": "" | |
} | |
}, | |
"scene": { | |
"xaxis": { | |
"backgroundcolor": "#E5ECF6", | |
"gridcolor": "white", | |
"gridwidth": 2, | |
"linecolor": "white", | |
"showbackground": true, | |
"ticks": "", | |
"zerolinecolor": "white" | |
}, | |
"yaxis": { | |
"backgroundcolor": "#E5ECF6", | |
"gridcolor": "white", | |
"gridwidth": 2, | |
"linecolor": "white", | |
"showbackground": true, | |
"ticks": "", | |
"zerolinecolor": "white" | |
}, | |
"zaxis": { | |
"backgroundcolor": "#E5ECF6", | |
"gridcolor": "white", | |
"gridwidth": 2, | |
"linecolor": "white", | |
"showbackground": true, | |
"ticks": "", | |
"zerolinecolor": "white" | |
} | |
}, | |
"shapedefaults": { | |
"line": { | |
"color": "#2a3f5f" | |
} | |
}, | |
"ternary": { | |
"aaxis": { | |
"gridcolor": "white", | |
"linecolor": "white", | |
"ticks": "" | |
}, | |
"baxis": { | |
"gridcolor": "white", | |
"linecolor": "white", | |
"ticks": "" | |
}, | |
"bgcolor": "#E5ECF6", | |
"caxis": { | |
"gridcolor": "white", | |
"linecolor": "white", | |
"ticks": "" | |
} | |
}, | |
"title": { | |
"x": 0.05 | |
}, | |
"xaxis": { | |
"automargin": true, | |
"gridcolor": "white", | |
"linecolor": "white", | |
"ticks": "", | |
"title": { | |
"standoff": 15 | |
}, | |
"zerolinecolor": "white", | |
"zerolinewidth": 2 | |
}, | |
"yaxis": { | |
"automargin": true, | |
"gridcolor": "white", | |
"linecolor": "white", | |
"ticks": "", | |
"title": { | |
"standoff": 15 | |
}, | |
"zerolinecolor": "white", | |
"zerolinewidth": 2 | |
} | |
} | |
}, | |
"title": { | |
"text": "17061" | |
}, | |
"xaxis": { | |
"anchor": "y", | |
"domain": [ | |
0, | |
1 | |
], | |
"title": { | |
"text": "row_id" | |
} | |
}, | |
"yaxis": { | |
"anchor": "x", | |
"domain": [ | |
0, | |
1 | |
], | |
"title": { | |
"text": "value" | |
} | |
} | |
} | |
} | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"cfips = ds.query('is_train').cfips.sample().unique()[0]\n", | |
"\n", | |
"series = ds.query('cfips == @cfips')['target'].to_frame('target')\n", | |
"series['oof'] = all_oof[all_oof.index.map(lambda x: x.split('_')[0] == str(cfips))][1]\n", | |
"series.index = pd.to_datetime(series.index.map(lambda x: x.split('_')[1]))\n", | |
"\n", | |
"ax = series.plot(kind='scatter', title=str(cfips))\n", | |
"ax = ax.add_vrect(x0=cut_off + one_month, x1=cut_off + 8 * one_month, fillcolor='yellow', opacity=0.2)\n", | |
"ax" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Submission" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>row_id</th>\n", | |
" <th>microbusiness_density</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1001_2022-11-01</td>\n", | |
" <td>3.817671</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1003_2022-11-01</td>\n", | |
" <td>3.817671</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1005_2022-11-01</td>\n", | |
" <td>3.817671</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1007_2022-11-01</td>\n", | |
" <td>3.817671</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1009_2022-11-01</td>\n", | |
" <td>3.817671</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" row_id microbusiness_density\n", | |
"0 1001_2022-11-01 3.817671\n", | |
"1 1003_2022-11-01 3.817671\n", | |
"2 1005_2022-11-01 3.817671\n", | |
"3 1007_2022-11-01 3.817671\n", | |
"4 1009_2022-11-01 3.817671" | |
] | |
}, | |
"execution_count": 33, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"sample_sub = pd.read_csv('data/sample_submission.csv')\n", | |
"sample_sub.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>row_id</th>\n", | |
" <th>microbusiness_density</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1001_2022-11-01</td>\n", | |
" <td>3.465837</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1003_2022-11-01</td>\n", | |
" <td>8.371675</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1005_2022-11-01</td>\n", | |
" <td>1.229685</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1007_2022-11-01</td>\n", | |
" <td>1.297983</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1009_2022-11-01</td>\n", | |
" <td>1.837880</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" row_id microbusiness_density\n", | |
"0 1001_2022-11-01 3.465837\n", | |
"1 1003_2022-11-01 8.371675\n", | |
"2 1005_2022-11-01 1.229685\n", | |
"3 1007_2022-11-01 1.297983\n", | |
"4 1009_2022-11-01 1.837880" | |
] | |
}, | |
"execution_count": 34, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"sub = ds[~ds.is_train]['target'].rename('microbusiness_density').reset_index()\n", | |
"sub.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"assert len(sub) == len(sample_sub)\n", | |
"assert sub.row_id.equals(sample_sub.row_id)\n", | |
"assert not sub.microbusiness_density.isnull().any()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"sub.to_csv('submission.csv.zip', index=False)" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.11.0" | |
}, | |
"orig_nbformat": 4, | |
"vscode": { | |
"interpreter": { | |
"hash": "55fbbcf542e06cc59ad76a1e0d5dc36ee204d6d2b704491656ee6b3487310122" | |
} | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment