Let's start with some dummy data, which we will enter using iPython. We fake up normally distributed data around y ~ x + 10.
In : import numpy as np In : x = np.random.randn(100) In : y = x + np.random.randn(100) + 10
We can plot this simply ...
In : import matplotlib.pyplot as plt In : fig, ax = plt.subplots(figsize=(8, 4)) In : ax.scatter(x, y, alpha=0.5, color='orchid') Out:
In : fig.suptitle('Example Scatter Plot') Out: In : fig.tight_layout(pad=2); In : ax.grid(True) In : fig.savefig('filename1.png', dpi=125)
That was easy. Next we will add a regression line. We will use the statsmodels package to calculate the regression line. Lines 11 to 15 is where we model the regression. Lines 16 to 20 we calculate and plot the regression line.
The key trick is at line 12: we need to add the intercept term explicitly. Without with this step, the regression model would be: y ~ x, rather than y ~ x + c. Similarly, at line 17, we include an intercept term in the data we provide to the predicting method at line 18. The sm.add_constant() method prepends a column of ones for the constant term in the regression model, returning a two column numpy array. The first column is ones, the second column is our original data from above.
In : import statsmodels.api as sm In : x = sm.add_constant(x) # constant intercept term In : # Model: y ~ x + c In : model = sm.OLS(y, x) In : fitted = model.fit() In : x_pred = np.linspace(x.min(), x.max(), 50) In : x_pred2 = sm.add_constant(x_pred) In : y_pred = fitted.predict(x_pred2) In : ax.plot(x_pred, y_pred, '-', color='darkorchid', linewidth=2) Out: [
] In : fig.savefig('filename2.png', dpi=125)
If we wanted key data from the regression, the following would do the job, after line 15:
print(fitted.params) # the estimated parameters for the regression line print(fitted.summary()) # summary statistics for the regression
We can add a confidence interval for the regression. There is a 95 per cent probability that the true regression line for the population lies within the confidence interval for our estimate of the regression line calculated from the sample data. We will calculate this from scratch, largely because I am not aware of a simple way of doing it within the statsmodels package.
To get the necessary t-statistic, I have imported the scipy stats package at line 27, and calculated the t-statistic at line 28.
In : y_hat = fitted.predict(x) # x is an array from line 12 above In : y_err = y - y_hat In : mean_x = x.T.mean() In : n = len(x) In : dof = n - fitted.df_model - 1 In : from scipy import stats In : t = stats.t.ppf(1-0.025, df=dof) In : s_err = np.sum(np.power(y_err, 2)) In : conf = t * np.sqrt((s_err/(n-2))*(1.0/n + (np.power((x_pred-mean_x),2) / ....: ((np.sum(np.power(x_pred,2))) - n*(np.power(mean_x,2)))))) In : upper = y_pred + abs(conf) In : lower = y_pred - abs(conf) In : ax.fill_between(x_pred, lower, upper, color='#888888', alpha=0.4) Out:
In : fig.savefig('filename3.png', dpi=125)
The final step is a prediction interval. There is a 95 per cent probability that the real value of y in the population for a given value of x lies within the prediction interval. There is a statsmodels method in the sandbox we can use.
In : from statsmodels.sandbox.regression.predstd import wls_prediction_std In : sdev, lower, upper = wls_prediction_std(fitted, exog=x_pred2, alpha=0.05) In : ax.fill_between(x_pred, lower, upper, color='#888888', alpha=0.1) Out:
In : fig.savefig('filename4.png', dpi=125)