josef-pkt · November 22, 2019 04:38
diff --git a/example_vif.ipynb b/example_vif.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multicollinearity and variance inflation factor\n",
    "\n",
    "This notebook illustrates new multicollinearity measures in statsmodels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import statsmodels.stats.multicollinearity as smmc\n",
    "import statsmodels.stats.outliers_influence as smoi\n",
    "from statsmodels.tools.tools import add_constant\n",
    "import warnings\n",
    "warnings.simplefilter('ignore', category=FutureWarning)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we generate some data where two variables are highly correlated. The last two variables have correlation close to 1.\n",
    "All variables are correlated with the last variable, but conditional on the last variable, all remaining random variables are uncorrelated from each other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def _get_data(nobs=100, k_vars=4):\n",
    "    np.random.seed(987536)\n",
    "    rho_coeff = np.linspace(0.3, 0.99, k_vars - 1)\n",
    "    x = np.random.randn(nobs, k_vars - 1) * (1 - rho_coeff)\n",
    "    z = np.random.randn(nobs, 1)\n",
    "    x += rho_coeff * z\n",
    "    return np.column_stack((x, z))\n",
    "\n",
    "x = _get_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1.        , 0.3927787 , 0.37594755, 0.3766416 ],\n",
       "       [0.3927787 , 1.        , 0.88239163, 0.88267024],\n",
       "       [0.37594755, 0.88239163, 1.        , 0.99995756],\n",
       "       [0.3766416 , 0.88267024, 0.99995756, 1.        ]])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.corrcoef(x, rowvar=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the multicollinearity class with data that does not include a constant and default setting. This will demean and standardize the data by default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "mc = smmc.MultiCollinearity(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected the last two variables have a very large variance inflation factor around 11900."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1.19440749e+00, 4.62517166e+00, 1.18814501e+04, 1.19102450e+04])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mc.vif"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note, that the variance inflation factor for non-singular data is just a one-liner using numpy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1.19440749e+00, 4.62517166e+00, 1.18814501e+04, 1.19102450e+04])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.diag(np.linalg.inv(np.corrcoef(x, rowvar=0)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compare the computed vif with those of the existing function in outliers_influence. Because that function is intended for the use with a design matrix, we need to handle the constant by either demeaning the data or by including a constant in the design matrix.\n",
    "\n",
    "The variance inflation factors are identical (*) in all three ways of computing it.\n",
    "\n",
    "(*) up to floating point precision of the underlying computation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1.19440749e+00, 4.62517166e+00, 1.18814501e+04, 1.19102450e+04])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "xdm = x - x.mean(0)\n",
    "np.array([smoi.variance_inflation_factor(xdm, ii) for ii in range(xdm.shape[1])])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1.00915697e+00, 1.19440749e+00, 4.62517166e+00, 1.18814501e+04,\n",
       "       1.19102450e+04])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "xf = add_constant(x)\n",
    "np.array([smoi.variance_inflation_factor(xf, ii) for ii in range(xf.shape[1])])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The variance inflation factor is directly related to the partial $R^2$ which is the $R^2$ of a regression of one variable on all others.\n",
    "The $R^2$ of the last two variables are close to 1 because of their strong correlation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.16276479, 0.78379181, 0.99991584, 0.99991604])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mc.rsquared_partial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The MultiCollinearity class computes the standardized moment matrix, which is just the correlation matrix of the data. This is attached as attribute `mom`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1.        , 0.3927787 , 0.37594755, 0.3766416 ],\n",
       "       [0.3927787 , 1.        , 0.88239163, 0.88267024],\n",
       "       [0.37594755, 0.88239163, 1.        , 0.99995756],\n",
       "       [0.3766416 , 0.88267024, 0.99995756, 1.        ]])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mc.mom"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All variables are correlated with each other, however the first three variables are only correlated through their correlation with the last variable.\n",
    "\n",
    "The partial correlation coefficient is defined as pairwise correlation conditional on all remaining variables.\n",
    "\n",
    "We see that the first two variables have small or no partial correlation with any variable.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 1.        ,  0.13477369, -0.07254555,  0.07388222],\n",
       "       [ 0.13477369,  1.        , -0.0452646 ,  0.0620937 ],\n",
       "       [-0.07254555, -0.0452646 ,  1.        ,  0.99980902],\n",
       "       [ 0.07388222,  0.0620937 ,  0.99980902,  1.        ]])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mc.corr_partial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The multicollinearity class also includes the analogues to the sum of squares in linear regression results. We use the same names, however those statistics are defined in terms of the correlation matrix or standardized data divided by the number of observations. So those are mean squares or variances.\n",
    "\n",
    "`tss` is the variance of the standardized data which are just 1.  \n",
    "`rss` is the mean squared error or variance of the residual (without degrees of freedom correction)\n",
    "\n",
    "These can be used to compute the `rsquared_partial` and the `vif`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1., 1., 1., 1.])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mc.tss"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([8.37235207e-01, 2.16208191e-01, 8.41648108e-05, 8.39613292e-05])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mc.rss"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1.08024700e-12, 1.61537450e-13, 1.99840144e-14, 1.99840144e-14])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(1 - mc.rss) - mc.rsquared_partial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1.19440749e+00, 4.62517166e+00, 1.18814501e+04, 1.19102450e+04])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mc.tss / mc.rss"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The absolute and relative errors between the two ways of computing vif are small but larger than floating point precision."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1.54121160e-12, 3.45590223e-12, 2.82856308e-06, 2.83541158e-06])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mc.tss / mc.rss - mc.vif"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1.29030120e-12, 7.47180096e-13, 2.38065567e-10, 2.38064901e-10])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mc.tss / mc.rss / mc.vif - 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The computation of the vif attribute uses a default ridge factor to avoid problems with the matrix inverse in singular or close to singular cases. In contrast, `rss` and `tss` are computed without any ridge correction.\n",
    "\n",
    "If we set the ridge factor to zero, then the agreement is at floating point precision."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.00000000e+00, 2.22044605e-16, 0.00000000e+00, 2.22044605e-16])"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mc.tss / mc.rss / mc.get_vif(ridge_factor=0.) - 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Dropping a multicollinear explanatory variable**\n",
    "\n",
    "When we drop one of the very highly correlated variables, then the partial correlation will adjust to a different conditioning set of variables. \n",
    "In this example, dropping one of the almost collinear variables increases the partial correlation of some remaining variables. For example, the correlation between the second and the last variable was small before and is now 0.86. \n",
    "\n",
    "This seems to make it difficult to interpret partial correlation because it strongly depends on the conditioning set, i.e. how many and which variables are included in the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "mc2 = smmc.MultiCollinearity(x[:, [0, 1, 3]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1.        , 0.13856419, 0.06928756],\n",
       "       [0.13856419, 1.        , 0.86245392],\n",
       "       [0.06928756, 0.86245392, 1.        ]])"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mc2.corr_partial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using `vif_selection` to recursively drop multicollinear variables.\n",
    "\n",
    "`vif_selection` recursively drops variables until only variables with a vif below a threshold are left.\n",
    "\n",
    "In the following example we use 10 correlated variables with increasing variance inflation factor. 3 variables with the highest vif are dropped. The remaining variables have a vif below the threshold in the reduced model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1.42, 1.35, 1.83, 2.43, 5.22, 10.68, 25.97, 99.43, 11641.69, 11761.19]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "k_vars=10\n",
    "\n",
    "x = _get_data(k_vars=k_vars)\n",
    "list(np.round(smmc.MultiCollinearity(x).vif, 2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0, 1, 2, 3, 4, 5, 6]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "keep = smmc.vif_selection(x)[0]\n",
    "keep"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1.39451288, 1.23820836, 1.73924112, 2.40351635, 4.56938883,\n",
       "       7.23742675, 8.70443214])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "xs = x[:, keep]\n",
    "mcs = smmc.MultiCollinearity(xs)\n",
    "mcs.vif"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[7, 8, 9]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dropped = [i for i in range(x.shape[1]) if i not in keep]\n",
    "dropped"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to verify that this does not depend on the variable order by increasing sequence of vif. After random shuffling columns, we still get the same result in terms of the original variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[2, 3, 4, 6, 7, 8, 9]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "idx = np.arange(k_vars)[::-1]\n",
    "np.random.seed(987456)\n",
    "np.random.shuffle(idx)\n",
    "keep_shuffled = smmc.vif_selection(x[:, idx])[0]\n",
    "keep_shuffled"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([9, 7, 4, 1, 2, 8, 6, 5, 3, 0])"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "idx"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([4, 1, 2, 6, 5, 3, 0])"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "idx[keep_shuffled]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 1, 2, 3, 4, 5, 6])"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.sort(idx[keep_shuffled])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`vif_selection` returns the index of variables to keep, but we can find the index of variables that are dropped using a simple list comprehension.\n",
    "We see that the variables with the highest vif, index 7, 8, 9, have been dropped by `vif_selection`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0, 1, 5]"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dropped_shuffled = [i for i in range(x.shape[1]) if i not in keep_shuffled]\n",
    "dropped_shuffled"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([7, 8, 9])"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.sort(idx[dropped_shuffled])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we use a pandas DataFrame for the data, then `vif_selection` returns the column names of the variables that we want to keep."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['x4', 'x1', 'x2', 'x6', 'x5', 'x3', 'x0'], dtype='object')"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "columns = ['x%d' % i for i in idx]\n",
    "xdf = pd.DataFrame(x[:, idx], columns=columns)\n",
    "keep_cols = smmc.vif_selection(xdf)[0]\n",
    "keep_cols"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Variable selection in OLS based on vif\n",
    "\n",
    "The high multicollinearity of the explanatory variables in this example makes most of the parameters statistically insignificant, most confidence intervals include zero. Dropping variables until we have all vif below a threshold of 10 reduces but does not eliminate the effect.\n",
    "\n",
    "In the full model only one coefficient has a p-value of 0.05, all others are statistically insignificant. However, $R^2$ is 0.67 and the F-test that only the constant is nonzero is strongly rejected. This means that the explanatory variables have jointly some predictive power, but we cannot reliably estimate individual parameters.\n",
    "\n",
    "Below, I compute wald_tests and t_tests for the sum of coefficient of the highly collinear explanatory variables. For example we can see that the effect of the sum of the variables that have been dropped by `vif_selection`, `'x7 + x8 + x9 + x10'` is large and positive.\n",
    "\n",
    "Dropping variables based on vif might not be a good way to handle multicollinearity. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "from statsmodels.regression.linear_model import OLS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "nobs = x.shape[0]\n",
    "y = 0.1 * x.sum(1) + 0.5 *np.random.randn(nobs)\n",
    "exog = add_constant(xdf[keep_cols.sort_values()])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                      y   R-squared:                       0.674\n",
      "Model:                            OLS   Adj. R-squared:                  0.649\n",
      "Method:                 Least Squares   F-statistic:                     27.12\n",
      "Date:                Thu, 21 Nov 2019   Prob (F-statistic):           7.84e-20\n",
      "Time:                        23:36:15   Log-Likelihood:                -68.297\n",
      "No. Observations:                 100   AIC:                             152.6\n",
      "Df Residuals:                      92   BIC:                             173.4\n",
      "Df Model:                           7                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "==============================================================================\n",
      "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "const         -0.0872      0.051     -1.704      0.092      -0.189       0.014\n",
      "x0            -0.0725      0.072     -1.002      0.319      -0.216       0.071\n",
      "x1             0.2391      0.072      3.343      0.001       0.097       0.381\n",
      "x2             0.1807      0.108      1.669      0.098      -0.034       0.396\n",
      "x3            -0.0038      0.111     -0.034      0.973      -0.225       0.217\n",
      "x4            -0.1242      0.136     -0.911      0.365      -0.395       0.147\n",
      "x5             0.2694      0.176      1.534      0.129      -0.079       0.618\n",
      "x6             0.5211      0.176      2.958      0.004       0.171       0.871\n",
      "==============================================================================\n",
      "Omnibus:                        0.001   Durbin-Watson:                   1.910\n",
      "Prob(Omnibus):                  0.999   Jarque-Bera (JB):                0.087\n",
      "Skew:                          -0.007   Prob(JB):                        0.957\n",
      "Kurtosis:                       2.856   Cond. No.                         7.25\n",
      "==============================================================================\n",
      "\n",
      "Warnings:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
     ]
    }
   ],
   "source": [
    "res = OLS(y, exog).fit()\n",
    "print(res.summary())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                      y   R-squared:                       0.700\n",
      "Model:                            OLS   Adj. R-squared:                  0.667\n",
      "Method:                 Least Squares   F-statistic:                     20.80\n",
      "Date:                Thu, 21 Nov 2019   Prob (F-statistic):           2.59e-19\n",
      "Time:                        23:36:15   Log-Likelihood:                -64.009\n",
      "No. Observations:                 100   AIC:                             150.0\n",
      "Df Residuals:                      89   BIC:                             178.7\n",
      "Df Model:                          10                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "==============================================================================\n",
      "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "const         -0.1009      0.051     -1.987      0.050      -0.202    4.79e-07\n",
      "x1            -0.0627      0.071     -0.881      0.380      -0.204       0.079\n",
      "x2             0.2099      0.073      2.882      0.005       0.065       0.355\n",
      "x3             0.1185      0.108      1.096      0.276      -0.096       0.333\n",
      "x4            -0.0219      0.109     -0.202      0.841      -0.238       0.194\n",
      "x5            -0.2237      0.142     -1.576      0.119      -0.506       0.058\n",
      "x6            -0.0350      0.208     -0.168      0.867      -0.448       0.378\n",
      "x7            -0.0163      0.296     -0.055      0.956      -0.605       0.573\n",
      "x8            -0.5543      0.519     -1.069      0.288      -1.585       0.476\n",
      "x9             4.4436      5.179      0.858      0.393      -5.846      14.733\n",
      "x10           -3.1269      5.153     -0.607      0.546     -13.366       7.113\n",
      "==============================================================================\n",
      "Omnibus:                        3.335   Durbin-Watson:                   1.959\n",
      "Prob(Omnibus):                  0.189   Jarque-Bera (JB):                1.949\n",
      "Skew:                          -0.028   Prob(JB):                        0.377\n",
      "Kurtosis:                       2.318   Cond. No.                         349.\n",
      "==============================================================================\n",
      "\n",
      "Warnings:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
     ]
    }
   ],
   "source": [
    "resf = OLS(y, add_constant(x)).fit()\n",
    "print(resf.summary())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<class 'statsmodels.stats.contrast.ContrastResults'>\n",
       "<F test: F=array([[0.08154128]]), p=0.7758613990742989, df_denom=92, df_num=1>"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res.wald_test('x4 + x5 + x6 = 0.7')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<class 'statsmodels.stats.contrast.ContrastResults'>\n",
       "                             Test for Constraints                             \n",
       "==============================================================================\n",
       "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
       "------------------------------------------------------------------------------\n",
       "c0             0.6663      0.118      5.652      0.000       0.432       0.900\n",
       "=============================================================================="
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res.t_test('x4 + x5 + x6')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<class 'statsmodels.stats.contrast.ContrastResults'>\n",
       "                             Test for Constraints                             \n",
       "==============================================================================\n",
       "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
       "------------------------------------------------------------------------------\n",
       "c0             0.4655      0.144      3.243      0.002       0.180       0.751\n",
       "=============================================================================="
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "resf.t_test('x4 + x5 + x6 + x7 + x8 + x9 + x10')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<class 'statsmodels.stats.contrast.ContrastResults'>\n",
       "                             Test for Constraints                             \n",
       "==============================================================================\n",
       "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
       "------------------------------------------------------------------------------\n",
       "c0             0.7461      0.210      3.545      0.001       0.328       1.164\n",
       "=============================================================================="
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "resf.t_test('x7 + x8 + x9 + x10')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }