Sunday, December 28

Updated cheat sheets: python, pandas and matplotlib

I have updated my cheat sheets for python, pandas and matplotlib.

You can find them here: bit.ly/python_cs

Sunday, October 5

Sunday, August 3

Pandas 0.14.1

I have just upgraded to Pandas 0.14.1.

It was a pain. At first, none of my graphics programs worked. It looks like a change to the API for parsing Microsoft Excel files was the problem. I am not sure whether my previous approach was wrong (but worked serendipitously), or the API was deliberately changed to break old code (an unusual practice for a point release). If someone knows, I'd appreciate something in the comments below.

What follows are the classes I use to upload Australian Bureau of Statistics (ABS) and Reserve Bank of Australia (RBA) data, with the updates to the parsing stage commented.

And yes, I use Python 2.7, not Python 3 (it's what comes with the Apple Mac).

### ABSExcelLoader.py
### written in python 2.7 and pandas 0.14.1

import pandas as pd
assert( pd.__version__ >= '0.14.1' )

class ABSExcelLoader:

    def load(self, pathName, freq='M', index_name=None, verbose=False):
        """return a dictionary of pandas DataFrame objects for
           each Data work sheet in an ABS Excel spreadsheet"""

        wb = pd.ExcelFile(pathName)
        returnDict = {}

        for name in wb.sheet_names:
            if not 'Data' in name:
                continue

            # ExcelFile.parse: API behaviour change with pandas 14.1
            #df = wb.parse(sheetname=name, skiprows=8, header=9, index_col=0, na_values=['', '-', ' '])
            df = wb.parse(sheetname=name, skiprows=9, header=0, index_col=0, na_values=['', '-', ' '])

            periods = pd.PeriodIndex(pd.Series(df.index), freq=freq)
            df.set_index(keys=periods, inplace=True)
            df.index.name = index_name
            returnDict[name] = df

            if verbose:
                print ("\nFile: '{}', sheet: '{}'".format(pathName, name))
                print (df.iloc[:min(5, len(df)), :min(5, len(df.columns))])

        return returnDict

### ABSExcelLoader.py
### written in python 2.7 and pandas 0.14.1

import pandas as pd
assert( pd.__version__ >= '0.14.1' )

class RBAExcelLoader:

    def load(self, pathName, freq='M', index_name=None, verbose=False):
        """return a pandas DataFrame for an RBA Excel spreadsheet"""
        wb = pd.ExcelFile(pathName)
        sheetname = 'Data'

        # ExcelFile.parse: API behaviour change with pandas 14.1
        #df = wb.parse(sheetname, skiprows=9, header=10, index_col=0, na_values=['', '-', ' '])
        df = wb.parse(sheetname, skiprows=10, header=0, index_col=0, na_values=['', '-', ' '])

        periods = pd.PeriodIndex(pd.Series(df.index), freq=freq)
        df.set_index(keys=periods, inplace=True)

        if verbose:
            print "\nFile: '{}', sheet: '{}'".format(pathName, sheetname)
            print 'Columns: {}'.format(df.columns.tolist())
            print 'Top left hand corner ...'
            print '------------------------'
            print df.iloc[:min(5, len(df)), :min(5, len(df.columns))]

        return df

Tuesday, July 15

Seasonal adjustment and decomposition

Having moved from R to python/pandas, one of the things I really miss is the extensive statistics library available in R.

I particularly like the STL function: the seasonal decomposition of a time series by Loess (localised regression). The function allows you to decompose a time series into its seasonal, trend and irregular components using loess. Unfortunately, I could find nothing like it in pandas.

In frustration, I have cobbled together a python/pandas function to decompose a univariate pandas time series item into seasonal, trend and irregular components. The function does not use Loess. Rather it uses a set of processes similar to those used by the Australian Bureau of Statistics.

It also uses a Henderson Moving Average which I had coded previously (see here).

Anyway, I'd be interested to know whether this code works for you. Drop me a line if you end up using it.

Caution: still experimental code.

## Decompose.py
## produce a time series decomposition

import pandas as pd
import numpy as np
from Henderson import Henderson


# --- A selection of seasonal smoothing weights, from which you can select
#     Note: these are end-weights, they are reversed for the start of a series
#     Note: your own weights in this form should also work
s3x3 = (
    np.array([ 5, 11, 11]) / 27.0,
    np.array([ 3,  7, 10,  7]) / 27.0,
    np.array([ 1,  2,  3,  2,  1]) / 9.0,
)

s3x5 = (
    np.array([ 9, 17, 17, 17]) / 60.0,
    np.array([ 4, 11, 15, 15, 15]) / 60.0,
    np.array([ 4,  8, 13, 13, 13,  9]) / 60.0,
    np.array([ 1,  2,  3,  3,  3,  2,  1]) / 15.0,
)

s3x9 = (
    np.array([0.051, 0.112, 0.173, 0.197, 0.221, 0.246]),
    np.array([0.028, 0.092, 0.144, 0.160, 0.176, 0.192, 0.208]),
    np.array([0.032, 0.079, 0.123, 0.133, 0.143, 0.154, 0.163, 0.173]),
    np.array([0.034, 0.075, 0.113, 0.117, 0.123, 0.128, 0.132, 0.137, 0.141]),
    np.array([0.034, 0.073, 0.111, 0.113, 0.114, 0.116, 0.117, 0.118, 0.120, 0.084]),
    np.array([1,     2,     3,     3,     3,    3,    3,    3,    3,    2,    1]) / 27.0,
)


# --- public Decomposition function
def Decompose(s, periods=None, model='multiplicative', 
    constantSeasonal=False, seasonalSmoother=s3x5):

    """ The simple decomposition of a pandas Series s into its trend, seasonal 
        and irregular components. The default is a multiplicative model: 
        Original(t) = Trend(t) * Seasonal(t) * Irregular(t). Can specify an 
        additive model: Original(t) = Trend(t) + Seasonal(t) + Irregular(t)

        Parameters:
        -   s - the pandas Series, without any missing or NA values, 
            and sorted in ascending order
        -   periods - either a pandas Series indicating the period to 
            which each value of s belongs (of the same length as s, 
            with the same index as s), or an int for the number of periods
            into which to decompose the series
        -   model - string - either 'multiplicative' or 'additive'
        -   constantSeasonal - bool - whether the seasonal component is 
            constant or (slowly) variable
        -   seasonalSmoother - when not using a constantSeasonal, which 
            of the seasonal smoothers to use (s3x3, s3x5 or s3x9) - 
            default is s3x5 (ie over 7 years for monthly or quarterly data)

        Returns a pandas DataFrame with columns for each step in the 
        decomposition process (enables debugging). The key columns in the 
        DataFrame are:
        -   'Original' - the original series
        -   'SeasAdj' - the seasonally adjusted series
        -   'Trend' - the trend of the seasonally adjusted series
        -   'Seasonal' - the seasonal component found through the 
            decomposition process
        -   'Irregular' - the irregular component found through the 
            decomposition process

        Notes:
        Based on ideas gleaned from the Australian Bureau of Statistics:
            ABS (2005), "An Introductory Course on Times Series
            Analysis -- Electronic Delivery", Catalogue: 1346,0.55.001, online at:
            http://www.ausstats.abs.gov.au/ausstats/free.nsf/0/C2714BAD1DD803E6CA256F960072B0C0/$File/1346055001_jan%202005.pdf
        Does not adjust for moving holidays, public holidays, variable number 
        of working days in month, etc. (ie. it is quite a simple decomposition)"""

    ### --- Sanity checks and initialisation --- ###

    # --- sanity checks
    if periods is None:
        raise ValueError('The periods parameter is an integer or a Series of integers')
    if not isinstance(s, pd.core.frame.Series):
        raise TypeError('The s parameter should be a pandas Series')
    if not(s.index.is_monotonic and s.index.is_unique):
        raise ValueError('The index for the s parameter should be unique and sorted')
    if any(s.isnull()) or not all(np.isfinite(s)):
        raise ValueError('The s parameter contains NA or infinite values')

    # --- initialise
    result = pd.DataFrame(s)
    result.columns = ['Original']

    # --- determine the period
    if isinstance(periods, pd.core.frame.Series):
        if not (len(s) == len(periods) and all(s.index == periods.index)) :
            raise ValueError('The s and periods parameters must have the same index')
        result['period'] = periods
        periods = len(periods.unique())
    else:
        periods = int(periods)
        result['period'] = pd.Series(range(len(result)), index=s.index) % periods
    if periods < 2:
        raise ValueError('The periods parameter should be >= 2')
    if len(s) < (periods * 2) + 1:
        raise ValueError('The s parameter is not long enough to decompose')

    # --- settle the length of the Henderson moving average
    h = max(periods, 7) # ABS uses 13-term HMA for monthly and 7-term for quarterly
    if h % 2 == 0 :
        h += 1 # we need an odd number

    ### --- On to the decomposition process --- ###

    # --- 1 - derive an initial estimate for the trend component
    result['1stTrendEst'] = pd.rolling_mean(s, window=periods+1, 
        min_periods=periods+1, center=True)
    # Note: rolling mean leaves NA values at the start/end of the trend estimate.

    # --- 2 - preliminary estimate of the seasonal component
    if model == 'multiplicative':
        result['1stSeasonalEst'] = result['Original'] / result['1stTrendEst']
    else:
        result['1stSeasonalEst'] = result['Original'] - result['1stTrendEst']

    # --- 3 - smooth the seasonal
    result = _smoothSeasonalComponent(result, periods=periods, 
        constantSeasonal=constantSeasonal, seasonalSmoother=seasonalSmoother, 
        columnToBeSmoothed='1stSeasonalEst', newColumn='2ndSeasonalEst')

    # --- 4 - extend the smoothed seasonal estimate to full scale
    if any(result['2ndSeasonalEst'].isnull()) :
        result = _extendSeries(result, periods=periods,
            columnToBeExtended='2ndSeasonalEst', newColumn='3rdSeasonalEst')
    else:
        result['3rdSeasonalEst'] = result['2ndSeasonalEst']

    # --- 5 - preliminary estimate of the seasonally adjusted data
    if model == 'multiplicative':
        result['1stSeasAdjEst'] = result['Original'] / result['3rdSeasonalEst']
    else:
        result['1stSeasAdjEst'] = result['Original'] - result['3rdSeasonalEst']

    # --- 6 - a better estimate of the trend
    result['2ndTrendEst'] =  Henderson(result['1stSeasAdjEst'], h)

    # --- 7 - final estimate of the seasonal component
    if model == 'multiplicative':
        result['4thSeasonalEst'] = result['Original'] / result['2ndTrendEst']
    else:
        result['4thSeasonalEst'] = result['Original'] - result['2ndTrendEst']

    result = _smoothSeasonalComponent(result, periods=periods, 
        constantSeasonal=constantSeasonal, seasonalSmoother=seasonalSmoother, 
        columnToBeSmoothed='4thSeasonalEst', newColumn='Seasonal')

    # --- 8 - final estimate of the seasonally adjusted series
    if model == 'multiplicative':
        result['SeasAdj'] = result['Original'] / result['Seasonal']
    else:
        result['SeasAdj'] = result['Original'] - result['Seasonal']

    # --- 9 - final trend estimate
    result['Trend'] =  Henderson(result['SeasAdj'], h)

    # --- 10 - final irregular
    if model == 'multiplicative':
        result['Irregular'] = result['SeasAdj'] / result['Trend']
    else:
        result['Irregular'] = result['SeasAdj'] - result['Trend']

    # --- 11 - our job here is done
    return (result)


# --- apply seasonal smoother
def _smoothSeasonalComponent(result, periods, constantSeasonal, seasonalSmoother,
    columnToBeSmoothed, newColumn):

    # get the key smoothing constants
    if not constantSeasonal:
        kS = len(seasonalSmoother)
        lenS = (len(seasonalSmoother) * 2) -1
        centralS = seasonalSmoother[len(seasonalSmoother)-1]

    # establish an empty return column ...
    result[newColumn] = np.repeat(np.nan, len(result))

    # populate the return column ...
    for u in result['period'].unique() :

        # get each of of the seasonals
        thisSeason = result[result['period'] == u][columnToBeSmoothed]

        # smooth to a constant seasonal value
        if constantSeasonal:
            thisSeasonSmoothed = pd.Series(np.repeat(thisSeason.mean(skipna=True),
                len(thisSeason)), index=thisSeason.index)

        # smooth to a slowly changing seasonal value
        else:
            # drop NA values which result from step 1 in the decomp process
            thisSeason = thisSeason.dropna()

            # apply the seasonalSmoother
            thisSeasonSmoothed = pd.rolling_apply(thisSeason, window=lenS,
                func=lambda x: (x * centralS).sum(), min_periods=lenS, center=True)

            # for short series the above process results in no data ... find a simple mean
            if all(thisSeasonSmoothed.isnull()) :
                # same treatment as constant seasonal value above
                thisSeasonSmoothed = pd.Series(np.repeat(thisSeason.mean(skipna=True),
                    len(thisSeason)), index=thisSeason.index)

            # handle the end-point problem ...
            for i in range(kS-1) :
                if np.isnan(thisSeasonSmoothed.iat[i]) :
                    thisSeasonSmoothed.iat[i] = (thisSeason.iloc[0: i+kS] *
                        (seasonalSmoother[i][::-1])).sum() # note: reversed order at start

            for i in range(len(thisSeason)-1, len(thisSeason)-kS, -1) :
                if np.isnan(thisSeasonSmoothed.iat[i]) :
                    thisSeasonSmoothed.iat[i] = (
                        thisSeason.iloc[i-(kS-1):len(thisSeason)] *
                        seasonalSmoother[len(thisSeason)-1-i]).sum()

        # package up season by season ...
        result[newColumn] = result[newColumn].where(result[newColumn].notnull(), 
            other=thisSeasonSmoothed)

    return (result)


# --- extend seasonal components to the full length of series 
def _extendSeries(result, periods, columnToBeExtended, newColumn):

    result[newColumn] = result[columnToBeExtended].copy()

    def fillup(result, fill, startPoint, endPoint):
        i = startPoint
        while True:
            p = result.index[i]
            result[newColumn].iat[i] = fill[newColumn].at[result['period'].iat[i]]
            if p >= endPoint:
                break
            i += 1

    # back-cast
    if np.isnan(result.iat[0, result.columns.get_loc(newColumn)]):
        fill = pd.DataFrame(result[newColumn].dropna().iloc[0:periods])
        fill['period'] = result['period'][fill.index[0]:fill.index[len(fill)-1]]
        endPoint = fill.index[0] - 1
        fill.index = fill['period']
        fillup(result=result, fill=fill, startPoint=0, endPoint=endPoint)

    # forward-cast
    if np.isnan(result.iat[len(result)-1, result.columns.get_loc(newColumn)]):
        fill = result[newColumn].dropna()
        fill = pd.DataFrame(fill[len(fill)-periods:len(fill)])
        fill['period'] = result['period'][fill.index[0]:fill.index[len(fill)-1]]
        startPoint = result.index.get_loc(fill.index[len(fill)-1] + 1)
        fill.index = fill['period']
        endPoint = result.index[len(result)-1]
        fillup(result=result, fill=fill, startPoint=startPoint, endPoint=endPoint)

    return (result)

Monday, June 23

Henderson Moving Average

I have posted my R code for a Henderson moving average here. This is the same code in python.

## Henderson.py
## calculate a Henderson moving average

import pandas as pd
import numpy as np

def hmaSymmetricWeights(n):
    """ derive an n-term array of symmetric 'Henderson Moving Average' weights
        formula from ABS (2003), 'A Guide to Interpreting Time Series', page 41.
        returns a numpy array of symmetric Henderson weights indexed from 0 to n-1"""

    # calculate the constant denominator and terms
    m = int((n-1)//2) # the mid point - n must be odd
    m1 = (m+1)*(m+1)
    m2 = (m+2)*(m+2)
    d = float(8*(m+2)*(m2-1)*(4*m2-1)*(4*m2-9)*(4*m2-25))
    m3 = (m+3)*(m+3)

    # calculate the weights
    w = np.repeat(np.nan, n) # Actually indexed from 0 to n-1
    for j in range(m+1):
        j2 = j*j
        v = (315*(m1-j2)*(m2-j2)*(m3-j2)*(3*m2-11*j2-16))/d
        w[(m+j)] = v
        if j > 0:
            w[(m-j)] = v

    w.flags.writeable = False # let's make it quasi-immutable
    return (w)


def hmaAsymmetricWeights(m, w):
    """calculate the asymmetric end-weights

        w --> an array of symmetrical henderson weights (from above function)
        m --> the number of asymmetric weights sought; where m < len(w);

        returns a numpy array of asymmetrical weights, indexed from 0 to m-1;

        formula from Mike Doherty (2001), 'The Surrogate Henderson Filters in X-11',
        Aust, NZ J of Stat. 43(4), 2001, pp901-999; see formula (1) on page 903"""

    n = len(w) # the number of weights

    # - some quick sanity checks
    if m >= n:
        raise ValueError('The m argument must be less than n')
    if m <= int((n-1)//2):
        raise ValueError('The m argument must be greater than (n-1)/2')

    # --- let's build up Doherty's formula (1) from the top of page 903

    # - the second chunk of the formula
    sumResidual = w[range(m, n)].sum() / float(m)

    # - the last chunk of the formula
    sumEnd = 0.0
    for i in range(m+1, n+1):
        sumEnd += (float(i)-((m+1.0)/2.0)) * w[i-1] # w indexed from 0 to n-1

    # - the beta squared / sigma squared - formula at the bottom of page 904
    ic = 1.0
    if n >= 13 and n < 15:
        ic = 3.5
    elif n >= 15:
     ic = 4.5
    b2s2 = (4.0/np.pi)/(ic*ic)

    # - the gnarly bit in the middle of the formula
    denominator = 1.0 + ((m*(m-1.0)*(m+1.0) / 12.0 ) * b2s2)
    u = np.repeat(np.nan, m) # return series - created empty
    for r in range(m): # r ranges 0 to m-1; but the formulae assumes 1 to m
        numerator = ((r+1.0) - (m+1.0)/2.0) * b2s2
        # - finally putting it all together
        u[r] = w[r] + sumResidual + ( numerator / denominator ) * sumEnd

    u.flags.writeable = False # let's make it quasi-immutable
    return (u)


def Henderson(s, n):
    """ Calculate an n-term Henderson Moving Average for the Series s
        Note: we blithely assume s is ordered, contiguous and without missing data"""

    # - some simple sanity checks
    if not isinstance(s, pd.core.series.Series):
        raise TypeError('The s argument should be a pandas Series')
    if not isinstance(n, int):
        raise TypeError('The n argument must be an integer')
    if n < 5:
        raise ValueError('The n argument must be >= 5')
    if n % 2 == 0:
        raise ValueError('The n argument must be odd')
    if len(s) < n:
        raise ValueError('The s argument should be a Series longer than n')

    # - calculate the symmetric weights
    w = hmaSymmetricWeights(n)

    # preliminaries
    r = pd.Series(np.repeat(np.nan, len(s)), index=s.index) # the empty return vehicle
    m = int((n-1)//2)
    l = len(s)

    # - and now move over the length of the series ...
    for i in range(len(s)) :
        if i < m:
            # --- head section of series
            u = hmaAsymmetricWeights(m+i+1, w)[::-1] # reverse - asymmetric to the left
            r.iloc[i] = (s.iloc[0:(i+m+1)] * u).sum()
        elif i + m >= l:
            # --- tail section of series
            u = hmaAsymmetricWeights(m+l-i, w)
            r.iloc[i] = (s.iloc[(i-m):l] * u).sum()
        else:
            # --- middle section of series
            r.iloc[i] = (s.iloc[(i-m):(i+m+1)] * w).sum()

    return (r)


### - test code
#--------------
# Check against Table 1 in B Quenneville and B Lefrancois (2001),
# "Implicit Forecasts in Musgrave Asymmetric Averages",
# Proceedings of the Annual Meeting of the American Statistical Association,
# August 5-9, 2001.
#--------------
#w = hmaSymmetricWeights(9)
#print(w)
#print(w.sum()) # should be one
#u = hmaAsymmetricWeights(7, w)
#print(u)
#print(u.sum()) # should be one
#--------------
#print (Henderson(pd.Series(range(30))+pd.Series(np.random.randn(30)), 13))

Saturday, January 25

Sunday, January 5

Pandas DataFrame cheat sheet and the Python v R debate

This has taken a lot longer than I thought it would. But I now have a rough, early draft of a cheat sheet on the Python pandas DataFrame object. When dealing with such a rich environment, it is a challenging decision on what to include and exclude in a four page set of notes. As always, comments and suggestions for improvement are always welcome. Doubtless there are many typographic errors that would be eradicated with another set of eyes and a ruthless proof reading of the draft text. 

I have also been thinking about the Python versus R debate. Which is the better tool for data analysis? Clearly, this is one of those questions to which there is no ultimate truth. Your answer will depend on what sort of analysis you do. My analytical interest is primarily data capture and visualization. While I do some simple statistical analyses, I do not use R in a deep statistical way. I do, however, depend heavily on Hadley Wickham's excellent graphics package ggplot2.

Before coming to the question, I should expose my relative experience with the two languages. I have been using R as my primary analytical language for a number of years. Over the same period I have used Python as well; albeit less frequently, and largely for data munging (loading data from various data sources into a MySQL server from which R can access it).

On the criterion of data munging, Python beats R hands down. Python is simply better able to manage dirty data in all its forms and is better for automating the cleaning and loading operations.

Another area where I think Python beats R is coding time. My impression is that I code at least twice as fast in Python. I spend much less time debugging code. I became a very defensive coder in R. Every function I wrote religiously tested the arguments to see that the values were of the right type and within the expected ranges. It was not unusual to start a function with 6 to 12 stopifnot() statements. Even so, I simply code faster in Python. There are a few reasons for this. While both languages are very expressive (compared with C, C++ or Java), I find Python the more expressive language. List comprehensions and generator expressions are powerful tools for tight code. While environments in R come close, they are no where as natural to use as dictionaries in Python. Second, Python's much stronger typing better protects me from my poor typing skills (no pun intended). Third, my learning curve with Python was much shorter than for R. But again, this may just be a product of my background (as someone coming from the C, C++, Objective-C and Java programming paradigms).

On graphics, I think Hadley Wickham's ggplot2 beats the competition from Python in the form of matplotlib. But work is afoot to replicate ggplot2 in the Python environment. When that work is well progressed I might just change ships.

Another area where R leads is idiomatic coherence. While the pandas DataFrame object is an immensely rich environment, it feels cobbled together and a little rough at the edges (it does not feel coherently designed from the ground up). Take the myriad of indexing options: [], .loc[], .iloc[], .ix[], .iat[], .at[], .xs[], which feel like the maze of twisty little passages, all alike, but each a little different. And then there are the confusing rules (for example, single indexes on a DataFrame return columns but single sliced indexes return rows). Furthermore, the Pythonic notion of container truthiness was not maintained by pandas (in the rest of Python, empty containers are False while non-empty containers are True, regardless of what they contain). I could go on. But, simply put, data.frames in R are more coherent with the rest of the R language compared with the DataFrame object and the rest of Python.

And another point of comparison in R's favour: the R help system is much more helpful than its counterpart from Python and pandas.

Finally there is something of the nanny state that annoys the hell out of me in both Hadley Wickham's ggplot2 and Wes McKinney's DataFrame. Hadley won't let you plot a chart with two y-axes. Wes, won't let you plot a time series as a bar chart. I can see the arguments for both protections. But really, is prohibition needed? Ironically, you can commit Hadley's unforgivable sin under Wes' DataFrame. And Hadley will happily let you plot a time series as a bar chart. It is time both Hadley and Wes embraced liberalism.

Wednesday, January 1

Reading ABS Excel files in pandas

Experimental code follows for reading excel files from the Australian Bureau of Statistics.

### ABSExcelLoader.py
### written in python 2.7 and pandas 0.12

import pandas as pd

class ABSExcelLoader:

    def load(self, pathName, freq='M', index_name=None, verbose=False):
        """return a dictionary of pandas DataFrame objects for
           each Data work sheet in an ABS Excel spreadsheet"""

        wb = pd.ExcelFile(pathName)
        returnDict = {}

        for name in wb.sheet_names:
            if not 'Data' in name:
                continue

            df = wb.parse(name, skiprows=8, header=9, index_col=0)
            periods = pd.PeriodIndex(pd.Series(df.index), freq=freq)
            df.set_index(keys=periods, inplace=True)
            df.index.name = index_name
            returnDict[name] = df

            if verbose:
                print ("\nFile: '{}', sheet: '{}'".format(pathName, name))
                print (df.iloc[:min(5, len(df)), :min(5, len(df.columns))])

        return returnDict