I particularly like the STL function: the seasonal decomposition of a time series by Loess (localised regression). The function allows you to decompose a time series into its seasonal, trend and irregular components using loess. Unfortunately, I could find nothing like it in pandas.

In frustration, I have cobbled together a python/pandas function to decompose a univariate pandas time series item into seasonal, trend and irregular components. The function does not use Loess. Rather it uses a set of processes similar to those used by the Australian Bureau of Statistics.

It also uses a Henderson Moving Average which I had coded previously (see here).

Anyway, I'd be interested to know whether this code works for you. Drop me a line if you end up using it.

Caution: still experimental code.

## Decompose.py ## produce a time series decomposition import pandas as pd import numpy as np from Henderson import Henderson # --- A selection of seasonal smoothing weights, from which you can select # Note: these are end-weights, they are reversed for the start of a series # Note: your own weights in this form should also work s3x3 = ( np.array([ 5, 11, 11]) / 27.0, np.array([ 3, 7, 10, 7]) / 27.0, np.array([ 1, 2, 3, 2, 1]) / 9.0, ) s3x5 = ( np.array([ 9, 17, 17, 17]) / 60.0, np.array([ 4, 11, 15, 15, 15]) / 60.0, np.array([ 4, 8, 13, 13, 13, 9]) / 60.0, np.array([ 1, 2, 3, 3, 3, 2, 1]) / 15.0, ) s3x9 = ( np.array([0.051, 0.112, 0.173, 0.197, 0.221, 0.246]), np.array([0.028, 0.092, 0.144, 0.160, 0.176, 0.192, 0.208]), np.array([0.032, 0.079, 0.123, 0.133, 0.143, 0.154, 0.163, 0.173]), np.array([0.034, 0.075, 0.113, 0.117, 0.123, 0.128, 0.132, 0.137, 0.141]), np.array([0.034, 0.073, 0.111, 0.113, 0.114, 0.116, 0.117, 0.118, 0.120, 0.084]), np.array([1, 2, 3, 3, 3, 3, 3, 3, 3, 2, 1]) / 27.0, ) # --- public Decomposition function def Decompose(s, periods=None, model='multiplicative', constantSeasonal=False, seasonalSmoother=s3x5): """ The simple decomposition of a pandas Series s into its trend, seasonal and irregular components. The default is a multiplicative model: Original(t) = Trend(t) * Seasonal(t) * Irregular(t). Can specify an additive model: Original(t) = Trend(t) + Seasonal(t) + Irregular(t) Parameters: - s - the pandas Series, without any missing or NA values, and sorted in ascending order - periods - either a pandas Series indicating the period to which each value of s belongs (of the same length as s, with the same index as s), or an int for the number of periods into which to decompose the series - model - string - either 'multiplicative' or 'additive' - constantSeasonal - bool - whether the seasonal component is constant or (slowly) variable - seasonalSmoother - when not using a constantSeasonal, which of the seasonal smoothers to use (s3x3, s3x5 or s3x9) - default is s3x5 (ie over 7 years for monthly or quarterly data) Returns a pandas DataFrame with columns for each step in the decomposition process (enables debugging). The key columns in the DataFrame are: - 'Original' - the original series - 'SeasAdj' - the seasonally adjusted series - 'Trend' - the trend of the seasonally adjusted series - 'Seasonal' - the seasonal component found through the decomposition process - 'Irregular' - the irregular component found through the decomposition process Notes: Based on ideas gleaned from the Australian Bureau of Statistics: ABS (2005), "An Introductory Course on Times Series Analysis -- Electronic Delivery", Catalogue: 1346,0.55.001, online at: http://www.ausstats.abs.gov.au/ausstats/free.nsf/0/C2714BAD1DD803E6CA256F960072B0C0/$File/1346055001_jan%202005.pdf Does not adjust for moving holidays, public holidays, variable number of working days in month, etc. (ie. it is quite a simple decomposition)""" ### --- Sanity checks and initialisation --- ### # --- sanity checks if periods is None: raise ValueError('The periods parameter is an integer or a Series of integers') if not isinstance(s, pd.core.frame.Series): raise TypeError('The s parameter should be a pandas Series') if not(s.index.is_monotonic and s.index.is_unique): raise ValueError('The index for the s parameter should be unique and sorted') if any(s.isnull()) or not all(np.isfinite(s)): raise ValueError('The s parameter contains NA or infinite values') # --- initialise result = pd.DataFrame(s) result.columns = ['Original'] # --- determine the period if isinstance(periods, pd.core.frame.Series): if not (len(s) == len(periods) and all(s.index == periods.index)) : raise ValueError('The s and periods parameters must have the same index') result['period'] = periods periods = len(periods.unique()) else: periods = int(periods) result['period'] = pd.Series(range(len(result)), index=s.index) % periods if periods < 2: raise ValueError('The periods parameter should be >= 2') if len(s) < (periods * 2) + 1: raise ValueError('The s parameter is not long enough to decompose') # --- settle the length of the Henderson moving average h = max(periods, 7) # ABS uses 13-term HMA for monthly and 7-term for quarterly if h % 2 == 0 : h += 1 # we need an odd number ### --- On to the decomposition process --- ### # --- 1 - derive an initial estimate for the trend component result['1stTrendEst'] = pd.rolling_mean(s, window=periods+1, min_periods=periods+1, center=True) # Note: rolling mean leaves NA values at the start/end of the trend estimate. # --- 2 - preliminary estimate of the seasonal component if model == 'multiplicative': result['1stSeasonalEst'] = result['Original'] / result['1stTrendEst'] else: result['1stSeasonalEst'] = result['Original'] - result['1stTrendEst'] # --- 3 - smooth the seasonal result = _smoothSeasonalComponent(result, periods=periods, constantSeasonal=constantSeasonal, seasonalSmoother=seasonalSmoother, columnToBeSmoothed='1stSeasonalEst', newColumn='2ndSeasonalEst') # --- 4 - extend the smoothed seasonal estimate to full scale if any(result['2ndSeasonalEst'].isnull()) : result = _extendSeries(result, periods=periods, columnToBeExtended='2ndSeasonalEst', newColumn='3rdSeasonalEst') else: result['3rdSeasonalEst'] = result['2ndSeasonalEst'] # --- 5 - preliminary estimate of the seasonally adjusted data if model == 'multiplicative': result['1stSeasAdjEst'] = result['Original'] / result['3rdSeasonalEst'] else: result['1stSeasAdjEst'] = result['Original'] - result['3rdSeasonalEst'] # --- 6 - a better estimate of the trend result['2ndTrendEst'] = Henderson(result['1stSeasAdjEst'], h) # --- 7 - final estimate of the seasonal component if model == 'multiplicative': result['4thSeasonalEst'] = result['Original'] / result['2ndTrendEst'] else: result['4thSeasonalEst'] = result['Original'] - result['2ndTrendEst'] result = _smoothSeasonalComponent(result, periods=periods, constantSeasonal=constantSeasonal, seasonalSmoother=seasonalSmoother, columnToBeSmoothed='4thSeasonalEst', newColumn='Seasonal') # --- 8 - final estimate of the seasonally adjusted series if model == 'multiplicative': result['SeasAdj'] = result['Original'] / result['Seasonal'] else: result['SeasAdj'] = result['Original'] - result['Seasonal'] # --- 9 - final trend estimate result['Trend'] = Henderson(result['SeasAdj'], h) # --- 10 - final irregular if model == 'multiplicative': result['Irregular'] = result['SeasAdj'] / result['Trend'] else: result['Irregular'] = result['SeasAdj'] - result['Trend'] # --- 11 - our job here is done return (result) # --- apply seasonal smoother def _smoothSeasonalComponent(result, periods, constantSeasonal, seasonalSmoother, columnToBeSmoothed, newColumn): # get the key smoothing constants if not constantSeasonal: kS = len(seasonalSmoother) lenS = (len(seasonalSmoother) * 2) -1 centralS = seasonalSmoother[len(seasonalSmoother)-1] # establish an empty return column ... result[newColumn] = np.repeat(np.nan, len(result)) # populate the return column ... for u in result['period'].unique() : # get each of of the seasonals thisSeason = result[result['period'] == u][columnToBeSmoothed] # smooth to a constant seasonal value if constantSeasonal: thisSeasonSmoothed = pd.Series(np.repeat(thisSeason.mean(skipna=True), len(thisSeason)), index=thisSeason.index) # smooth to a slowly changing seasonal value else: # drop NA values which result from step 1 in the decomp process thisSeason = thisSeason.dropna() # apply the seasonalSmoother thisSeasonSmoothed = pd.rolling_apply(thisSeason, window=lenS, func=lambda x: (x * centralS).sum(), min_periods=lenS, center=True) # for short series the above process results in no data ... find a simple mean if all(thisSeasonSmoothed.isnull()) : # same treatment as constant seasonal value above thisSeasonSmoothed = pd.Series(np.repeat(thisSeason.mean(skipna=True), len(thisSeason)), index=thisSeason.index) # handle the end-point problem ... for i in range(kS-1) : if np.isnan(thisSeasonSmoothed.iat[i]) : thisSeasonSmoothed.iat[i] = (thisSeason.iloc[0: i+kS] * (seasonalSmoother[i][::-1])).sum() # note: reversed order at start for i in range(len(thisSeason)-1, len(thisSeason)-kS, -1) : if np.isnan(thisSeasonSmoothed.iat[i]) : thisSeasonSmoothed.iat[i] = ( thisSeason.iloc[i-(kS-1):len(thisSeason)] * seasonalSmoother[len(thisSeason)-1-i]).sum() # package up season by season ... result[newColumn] = result[newColumn].where(result[newColumn].notnull(), other=thisSeasonSmoothed) return (result) # --- extend seasonal components to the full length of series def _extendSeries(result, periods, columnToBeExtended, newColumn): result[newColumn] = result[columnToBeExtended].copy() def fillup(result, fill, startPoint, endPoint): i = startPoint while True: p = result.index[i] result[newColumn].iat[i] = fill[newColumn].at[result['period'].iat[i]] if p >= endPoint: break i += 1 # back-cast if np.isnan(result.iat[0, result.columns.get_loc(newColumn)]): fill = pd.DataFrame(result[newColumn].dropna().iloc[0:periods]) fill['period'] = result['period'][fill.index[0]:fill.index[len(fill)-1]] endPoint = fill.index[0] - 1 fill.index = fill['period'] fillup(result=result, fill=fill, startPoint=0, endPoint=endPoint) # forward-cast if np.isnan(result.iat[len(result)-1, result.columns.get_loc(newColumn)]): fill = result[newColumn].dropna() fill = pd.DataFrame(fill[len(fill)-periods:len(fill)]) fill['period'] = result['period'][fill.index[0]:fill.index[len(fill)-1]] startPoint = result.index.get_loc(fill.index[len(fill)-1] + 1) fill.index = fill['period'] endPoint = result.index[len(result)-1] fillup(result=result, fill=fill, startPoint=startPoint, endPoint=endPoint) return (result)

Not my area - but some of your posts have given me some insight for python, so:

ReplyDeleteDid you look at Andreas Hiboll's python wrapper for the R's STL, which also uses Pandas (https://gist.github.com/andreas-h/7808564)

Or his python only version (https://github.com/andreas-h/pyloess)

Thanks for the heads-up - I was not aware of these.

Delete