## Sunday, May 3

### Using python statsmodels for OLS linear regression

This is a short post about using the python statsmodels package for calculating and charting a linear regression.

Let's start with some dummy data, which we will enter using iPython. We fake up normally distributed data around y ~ x + 10.

In [1]: import numpy as np

In [2]: x = np.random.randn(100)

In [3]: y = x + np.random.randn(100) + 10


We can plot this simply ...

In [4]: import matplotlib.pyplot as plt

In [5]: fig, ax = plt.subplots(figsize=(8, 4))

In [6]: ax.scatter(x, y, alpha=0.5, color='orchid')
Out[6]:

In [7]: fig.suptitle('Example Scatter Plot')
Out[7]:

In [9]: ax.grid(True)

In [10]: fig.savefig('filename1.png', dpi=125)


That was easy. Next we will add a regression line. We will use the statsmodels package to calculate the regression line. Lines 11 to 15 is where we model the regression. Lines 16 to 20 we calculate and plot the regression line.

The key trick is at line 12: we need to add the intercept term explicitly. Without with this step, the regression model would be: y ~ x, rather than y ~ x + c. Similarly, at line 17, we include an intercept term in the data we provide to the predicting method at line 18. The sm.add_constant() method prepends a column of ones for the constant term in the regression model, returning a two column numpy array. The first column is ones, the second column is our original data from above.

In [11]: import statsmodels.api as sm

In [12]: x = sm.add_constant(x) # constant intercept term

In [13]: # Model: y ~ x + c

In [14]: model = sm.OLS(y, x)

In [15]: fitted = model.fit()

In [16]: x_pred = np.linspace(x.min(), x.max(), 50)

In [18]: y_pred = fitted.predict(x_pred2)

In [19]: ax.plot(x_pred, y_pred, '-', color='darkorchid', linewidth=2)
Out[19]: []

In [20]: fig.savefig('filename2.png', dpi=125)


If we wanted key data from the regression, the following would do the job, after line 15:

print(fitted.params)     # the estimated parameters for the regression line
print(fitted.summary())  # summary statistics for the regression


We can add a confidence interval for the regression. There is a 95 per cent probability that the true regression line for the population lies within the confidence interval for our estimate of the regression line calculated from the sample data. We will calculate this from scratch, largely because I am not aware of a simple way of doing it within the statsmodels package.

To get the necessary t-statistic, I have imported the scipy stats package at line 27, and calculated the t-statistic at line 28.

In [22]: y_hat = fitted.predict(x) # x is an array from line 12 above

In [23]: y_err = y - y_hat

In [24]: mean_x = x.T[1].mean()

In [25]: n = len(x)

In [26]: dof = n - fitted.df_model - 1

In [27]: from scipy import stats

In [28]: t = stats.t.ppf(1-0.025, df=dof)

In [29]: s_err = np.sum(np.power(y_err, 2))

In [30]: conf = t * np.sqrt((s_err/(n-2))*(1.0/n + (np.power((x_pred-mean_x),2) /
....:     ((np.sum(np.power(x_pred,2))) - n*(np.power(mean_x,2))))))

In [31]: upper = y_pred + abs(conf)

In [32]: lower = y_pred - abs(conf)

In [33]: ax.fill_between(x_pred, lower, upper, color='#888888', alpha=0.4)
Out[33]:

In [34]: fig.savefig('filename3.png', dpi=125)


The final step is a prediction interval. There is a 95 per cent probability that the real value of y in the population for a given value of x lies within the prediction interval. There is a statsmodels method in the sandbox we can use.

In [35]: from statsmodels.sandbox.regression.predstd import wls_prediction_std

In [36]: sdev, lower, upper = wls_prediction_std(fitted, exog=x_pred2, alpha=0.05)

In [37]: ax.fill_between(x_pred, lower, upper, color='#888888', alpha=0.1)
Out[37]:

In [38]: fig.savefig('filename4.png', dpi=125)