Mark the Graph: May 2015

Tuesday, May 12, 2015

Plotting time-series DataFrames in pandas

Pandas provides a convenience method for plotting DataFrames: DataFrame.plot. There is also a quick guide here.

Unfortunately, when it comes to time series data, I don't always find the convenience method convenient. I often have a sparse DataFrame with lots of NaNs, which are not ignored by the convenience method. Additionally, I don't like the way that matplotlib places the lines hard against the left and right-hand sides of the canvas. I like a little bit of space at each end of the chart. Finally, I like playing with the tick marks and tick labels to get the right density of information on the x-axis.

Rather than use the inconvenient convenience method, I regularly find myself writing a short function to produce the plot layout I find a little more aesthetically pleasing. An example chart (from Mark the Ballot) and the associated python code follows.

import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter, YearLocator, MonthLocator
plt.style.use('ggplot')

def plot(df, filename, heading=None):

    fig, ax = plt.subplots(figsize=(8, 4))

    min_date = None
    max_date = None
    for col_name in df.columns.values:

        # plot the column
        col = df[col_name]
        col = col[col.notnull()] # drop NAs
        dates = [zzz.to_timestamp().date() for zzz in col.index]
        ax.plot_date(x=dates, y=col, fmt='-', label=col_name,
            tz=None, xdate=True, ydate=False, linewidth=1.5)

        # establish the date range for the data
        if min_date:
            min_date = min(min_date, min(dates))
        else:
            min_date = min(dates)
        if max_date:
            max_date = max(max_date, max(dates))
        else:
            max_date = max(dates)

    # give a bit of space at each end of the plot - aesthetics
    span = max_date - min_date
    extra = int(span.days * 0.03) * datetime.timedelta(days=1)
    ax.set_xlim([min_date - extra, max_date + extra])

    # format the x tick marks
    ax.xaxis.set_major_formatter(DateFormatter('%Y'))
    ax.xaxis.set_minor_formatter(DateFormatter('\n%b'))
    ax.xaxis.set_major_locator(YearLocator())
    ax.xaxis.set_minor_locator(MonthLocator(bymonthday=1, interval=2))

    # grid, legend and yLabel
    ax.grid(True)
    ax.legend(loc='best', prop={'size':'x-small'})
    ax.set_ylabel('Percent')

    # heading
    if heading:
        fig.suptitle(heading, fontsize=12)
    fig.tight_layout(pad=1.5)

    # footnote
    fig.text(0.99, 0.01, 'marktheballot.blogspot.com.au', ha='right', 
        va='bottom', fontsize=8, color='#999999')

    # save to file
    fig.savefig(filename, dpi=125)

Sunday, May 03, 2015

Using python statsmodels for OLS linear regression

This is a short post about using the python statsmodels package for calculating and charting a linear regression.

Let's start with some dummy data, which we will enter using iPython. We fake up normally distributed data around y ~ x + 10.

In [1]: import numpy as np

In [2]: x = np.random.randn(100)

In [3]: y = x + np.random.randn(100) + 10

We can plot this simply ...

In [4]: import matplotlib.pyplot as plt

In [5]: fig, ax = plt.subplots(figsize=(8, 4))

In [6]: ax.scatter(x, y, alpha=0.5, color='orchid')
Out[6]: 

In [7]: fig.suptitle('Example Scatter Plot')
Out[7]: 

In [8]: fig.tight_layout(pad=2); 

In [9]: ax.grid(True)

In [10]: fig.savefig('filename1.png', dpi=125)

That was easy. Next we will add a regression line. We will use the statsmodels package to calculate the regression line. Lines 11 to 15 is where we model the regression. Lines 16 to 20 we calculate and plot the regression line.

The key trick is at line 12: we need to add the intercept term explicitly. Without with this step, the regression model would be: y ~ x, rather than y ~ x + c. Similarly, at line 17, we include an intercept term in the data we provide to the predicting method at line 18. The sm.add_constant() method prepends a column of ones for the constant term in the regression model, returning a two column numpy array. The first column is ones, the second column is our original data from above.

In [11]: import statsmodels.api as sm

In [12]: x = sm.add_constant(x) # constant intercept term

In [13]: # Model: y ~ x + c

In [14]: model = sm.OLS(y, x)

In [15]: fitted = model.fit()

In [16]: x_pred = np.linspace(x.min(), x.max(), 50)

In [17]: x_pred2 = sm.add_constant(x_pred)

In [18]: y_pred = fitted.predict(x_pred2)

In [19]: ax.plot(x_pred, y_pred, '-', color='darkorchid', linewidth=2)
Out[19]: []

In [20]: fig.savefig('filename2.png', dpi=125)

If we wanted key data from the regression, the following would do the job, after line 15:

print(fitted.params)     # the estimated parameters for the regression line
print(fitted.summary())  # summary statistics for the regression

We can add a confidence interval for the regression. There is a 95 per cent probability that the true regression line for the population lies within the confidence interval for our estimate of the regression line calculated from the sample data. We will calculate this from scratch, largely because I am not aware of a simple way of doing it within the statsmodels package.

To get the necessary t-statistic, I have imported the scipy stats package at line 27, and calculated the t-statistic at line 28.

In [22]: y_hat = fitted.predict(x) # x is an array from line 12 above

In [23]: y_err = y - y_hat

In [24]: mean_x = x.T[1].mean()

In [25]: n = len(x)

In [26]: dof = n - fitted.df_model - 1

In [27]: from scipy import stats

In [28]: t = stats.t.ppf(1-0.025, df=dof)

In [29]: s_err = np.sum(np.power(y_err, 2))

In [30]: conf = t * np.sqrt((s_err/(n-2))*(1.0/n + (np.power((x_pred-mean_x),2) / 
   ....:     ((np.sum(np.power(x_pred,2))) - n*(np.power(mean_x,2))))))

In [31]: upper = y_pred + abs(conf)

In [32]: lower = y_pred - abs(conf)

In [33]: ax.fill_between(x_pred, lower, upper, color='#888888', alpha=0.4)
Out[33]: 

In [34]: fig.savefig('filename3.png', dpi=125)

The final step is a prediction interval. There is a 95 per cent probability that the real value of y in the population for a given value of x lies within the prediction interval. There is a statsmodels method in the sandbox we can use.

In [35]: from statsmodels.sandbox.regression.predstd import wls_prediction_std

In [36]: sdev, lower, upper = wls_prediction_std(fitted, exog=x_pred2, alpha=0.05)

In [37]: ax.fill_between(x_pred, lower, upper, color='#888888', alpha=0.1)
Out[37]: 

In [38]: fig.savefig('filename4.png', dpi=125)