I have updated my cheat sheets for python, pandas and matplotlib.
You can find them here: bit.ly/python_cs
Sunday, December 28
Sunday, October 5
Sunday, August 3
Pandas 0.14.1
I have just upgraded to Pandas 0.14.1.
It was a pain. At first, none of my graphics programs worked. It looks like a change to the API for parsing Microsoft Excel files was the problem. I am not sure whether my previous approach was wrong (but worked serendipitously), or the API was deliberately changed to break old code (an unusual practice for a point release). If someone knows, I'd appreciate something in the comments below.
What follows are the classes I use to upload Australian Bureau of Statistics (ABS) and Reserve Bank of Australia (RBA) data, with the updates to the parsing stage commented.
And yes, I use Python 2.7, not Python 3 (it's what comes with the Apple Mac).
It was a pain. At first, none of my graphics programs worked. It looks like a change to the API for parsing Microsoft Excel files was the problem. I am not sure whether my previous approach was wrong (but worked serendipitously), or the API was deliberately changed to break old code (an unusual practice for a point release). If someone knows, I'd appreciate something in the comments below.
What follows are the classes I use to upload Australian Bureau of Statistics (ABS) and Reserve Bank of Australia (RBA) data, with the updates to the parsing stage commented.
And yes, I use Python 2.7, not Python 3 (it's what comes with the Apple Mac).
### ABSExcelLoader.py ### written in python 2.7 and pandas 0.14.1 import pandas as pd assert( pd.__version__ >= '0.14.1' ) class ABSExcelLoader: def load(self, pathName, freq='M', index_name=None, verbose=False): """return a dictionary of pandas DataFrame objects for each Data work sheet in an ABS Excel spreadsheet""" wb = pd.ExcelFile(pathName) returnDict = {} for name in wb.sheet_names: if not 'Data' in name: continue # ExcelFile.parse: API behaviour change with pandas 14.1 #df = wb.parse(sheetname=name, skiprows=8, header=9, index_col=0, na_values=['', '-', ' ']) df = wb.parse(sheetname=name, skiprows=9, header=0, index_col=0, na_values=['', '-', ' ']) periods = pd.PeriodIndex(pd.Series(df.index), freq=freq) df.set_index(keys=periods, inplace=True) df.index.name = index_name returnDict[name] = df if verbose: print ("\nFile: '{}', sheet: '{}'".format(pathName, name)) print (df.iloc[:min(5, len(df)), :min(5, len(df.columns))]) return returnDict
### ABSExcelLoader.py ### written in python 2.7 and pandas 0.14.1 import pandas as pd assert( pd.__version__ >= '0.14.1' ) class RBAExcelLoader: def load(self, pathName, freq='M', index_name=None, verbose=False): """return a pandas DataFrame for an RBA Excel spreadsheet""" wb = pd.ExcelFile(pathName) sheetname = 'Data' # ExcelFile.parse: API behaviour change with pandas 14.1 #df = wb.parse(sheetname, skiprows=9, header=10, index_col=0, na_values=['', '-', ' ']) df = wb.parse(sheetname, skiprows=10, header=0, index_col=0, na_values=['', '-', ' ']) periods = pd.PeriodIndex(pd.Series(df.index), freq=freq) df.set_index(keys=periods, inplace=True) if verbose: print "\nFile: '{}', sheet: '{}'".format(pathName, sheetname) print 'Columns: {}'.format(df.columns.tolist()) print 'Top left hand corner ...' print '------------------------' print df.iloc[:min(5, len(df)), :min(5, len(df.columns))] return df
Tuesday, July 15
Seasonal adjustment and decomposition
Having moved from R to python/pandas, one of the things I really miss is the extensive statistics library available in R.
I particularly like the STL function: the seasonal decomposition of a time series by Loess (localised regression). The function allows you to decompose a time series into its seasonal, trend and irregular components using loess. Unfortunately, I could find nothing like it in pandas.
In frustration, I have cobbled together a python/pandas function to decompose a univariate pandas time series item into seasonal, trend and irregular components. The function does not use Loess. Rather it uses a set of processes similar to those used by the Australian Bureau of Statistics.
It also uses a Henderson Moving Average which I had coded previously (see here).
Anyway, I'd be interested to know whether this code works for you. Drop me a line if you end up using it.
Caution: still experimental code.
I particularly like the STL function: the seasonal decomposition of a time series by Loess (localised regression). The function allows you to decompose a time series into its seasonal, trend and irregular components using loess. Unfortunately, I could find nothing like it in pandas.
In frustration, I have cobbled together a python/pandas function to decompose a univariate pandas time series item into seasonal, trend and irregular components. The function does not use Loess. Rather it uses a set of processes similar to those used by the Australian Bureau of Statistics.
It also uses a Henderson Moving Average which I had coded previously (see here).
Anyway, I'd be interested to know whether this code works for you. Drop me a line if you end up using it.
Caution: still experimental code.
## Decompose.py ## produce a time series decomposition import pandas as pd import numpy as np from Henderson import Henderson # --- A selection of seasonal smoothing weights, from which you can select # Note: these are end-weights, they are reversed for the start of a series # Note: your own weights in this form should also work s3x3 = ( np.array([ 5, 11, 11]) / 27.0, np.array([ 3, 7, 10, 7]) / 27.0, np.array([ 1, 2, 3, 2, 1]) / 9.0, ) s3x5 = ( np.array([ 9, 17, 17, 17]) / 60.0, np.array([ 4, 11, 15, 15, 15]) / 60.0, np.array([ 4, 8, 13, 13, 13, 9]) / 60.0, np.array([ 1, 2, 3, 3, 3, 2, 1]) / 15.0, ) s3x9 = ( np.array([0.051, 0.112, 0.173, 0.197, 0.221, 0.246]), np.array([0.028, 0.092, 0.144, 0.160, 0.176, 0.192, 0.208]), np.array([0.032, 0.079, 0.123, 0.133, 0.143, 0.154, 0.163, 0.173]), np.array([0.034, 0.075, 0.113, 0.117, 0.123, 0.128, 0.132, 0.137, 0.141]), np.array([0.034, 0.073, 0.111, 0.113, 0.114, 0.116, 0.117, 0.118, 0.120, 0.084]), np.array([1, 2, 3, 3, 3, 3, 3, 3, 3, 2, 1]) / 27.0, ) # --- public Decomposition function def Decompose(s, periods=None, model='multiplicative', constantSeasonal=False, seasonalSmoother=s3x5): """ The simple decomposition of a pandas Series s into its trend, seasonal and irregular components. The default is a multiplicative model: Original(t) = Trend(t) * Seasonal(t) * Irregular(t). Can specify an additive model: Original(t) = Trend(t) + Seasonal(t) + Irregular(t) Parameters: - s - the pandas Series, without any missing or NA values, and sorted in ascending order - periods - either a pandas Series indicating the period to which each value of s belongs (of the same length as s, with the same index as s), or an int for the number of periods into which to decompose the series - model - string - either 'multiplicative' or 'additive' - constantSeasonal - bool - whether the seasonal component is constant or (slowly) variable - seasonalSmoother - when not using a constantSeasonal, which of the seasonal smoothers to use (s3x3, s3x5 or s3x9) - default is s3x5 (ie over 7 years for monthly or quarterly data) Returns a pandas DataFrame with columns for each step in the decomposition process (enables debugging). The key columns in the DataFrame are: - 'Original' - the original series - 'SeasAdj' - the seasonally adjusted series - 'Trend' - the trend of the seasonally adjusted series - 'Seasonal' - the seasonal component found through the decomposition process - 'Irregular' - the irregular component found through the decomposition process Notes: Based on ideas gleaned from the Australian Bureau of Statistics: ABS (2005), "An Introductory Course on Times Series Analysis -- Electronic Delivery", Catalogue: 1346,0.55.001, online at: http://www.ausstats.abs.gov.au/ausstats/free.nsf/0/C2714BAD1DD803E6CA256F960072B0C0/$File/1346055001_jan%202005.pdf Does not adjust for moving holidays, public holidays, variable number of working days in month, etc. (ie. it is quite a simple decomposition)""" ### --- Sanity checks and initialisation --- ### # --- sanity checks if periods is None: raise ValueError('The periods parameter is an integer or a Series of integers') if not isinstance(s, pd.core.frame.Series): raise TypeError('The s parameter should be a pandas Series') if not(s.index.is_monotonic and s.index.is_unique): raise ValueError('The index for the s parameter should be unique and sorted') if any(s.isnull()) or not all(np.isfinite(s)): raise ValueError('The s parameter contains NA or infinite values') # --- initialise result = pd.DataFrame(s) result.columns = ['Original'] # --- determine the period if isinstance(periods, pd.core.frame.Series): if not (len(s) == len(periods) and all(s.index == periods.index)) : raise ValueError('The s and periods parameters must have the same index') result['period'] = periods periods = len(periods.unique()) else: periods = int(periods) result['period'] = pd.Series(range(len(result)), index=s.index) % periods if periods < 2: raise ValueError('The periods parameter should be >= 2') if len(s) < (periods * 2) + 1: raise ValueError('The s parameter is not long enough to decompose') # --- settle the length of the Henderson moving average h = max(periods, 7) # ABS uses 13-term HMA for monthly and 7-term for quarterly if h % 2 == 0 : h += 1 # we need an odd number ### --- On to the decomposition process --- ### # --- 1 - derive an initial estimate for the trend component result['1stTrendEst'] = pd.rolling_mean(s, window=periods+1, min_periods=periods+1, center=True) # Note: rolling mean leaves NA values at the start/end of the trend estimate. # --- 2 - preliminary estimate of the seasonal component if model == 'multiplicative': result['1stSeasonalEst'] = result['Original'] / result['1stTrendEst'] else: result['1stSeasonalEst'] = result['Original'] - result['1stTrendEst'] # --- 3 - smooth the seasonal result = _smoothSeasonalComponent(result, periods=periods, constantSeasonal=constantSeasonal, seasonalSmoother=seasonalSmoother, columnToBeSmoothed='1stSeasonalEst', newColumn='2ndSeasonalEst') # --- 4 - extend the smoothed seasonal estimate to full scale if any(result['2ndSeasonalEst'].isnull()) : result = _extendSeries(result, periods=periods, columnToBeExtended='2ndSeasonalEst', newColumn='3rdSeasonalEst') else: result['3rdSeasonalEst'] = result['2ndSeasonalEst'] # --- 5 - preliminary estimate of the seasonally adjusted data if model == 'multiplicative': result['1stSeasAdjEst'] = result['Original'] / result['3rdSeasonalEst'] else: result['1stSeasAdjEst'] = result['Original'] - result['3rdSeasonalEst'] # --- 6 - a better estimate of the trend result['2ndTrendEst'] = Henderson(result['1stSeasAdjEst'], h) # --- 7 - final estimate of the seasonal component if model == 'multiplicative': result['4thSeasonalEst'] = result['Original'] / result['2ndTrendEst'] else: result['4thSeasonalEst'] = result['Original'] - result['2ndTrendEst'] result = _smoothSeasonalComponent(result, periods=periods, constantSeasonal=constantSeasonal, seasonalSmoother=seasonalSmoother, columnToBeSmoothed='4thSeasonalEst', newColumn='Seasonal') # --- 8 - final estimate of the seasonally adjusted series if model == 'multiplicative': result['SeasAdj'] = result['Original'] / result['Seasonal'] else: result['SeasAdj'] = result['Original'] - result['Seasonal'] # --- 9 - final trend estimate result['Trend'] = Henderson(result['SeasAdj'], h) # --- 10 - final irregular if model == 'multiplicative': result['Irregular'] = result['SeasAdj'] / result['Trend'] else: result['Irregular'] = result['SeasAdj'] - result['Trend'] # --- 11 - our job here is done return (result) # --- apply seasonal smoother def _smoothSeasonalComponent(result, periods, constantSeasonal, seasonalSmoother, columnToBeSmoothed, newColumn): # get the key smoothing constants if not constantSeasonal: kS = len(seasonalSmoother) lenS = (len(seasonalSmoother) * 2) -1 centralS = seasonalSmoother[len(seasonalSmoother)-1] # establish an empty return column ... result[newColumn] = np.repeat(np.nan, len(result)) # populate the return column ... for u in result['period'].unique() : # get each of of the seasonals thisSeason = result[result['period'] == u][columnToBeSmoothed] # smooth to a constant seasonal value if constantSeasonal: thisSeasonSmoothed = pd.Series(np.repeat(thisSeason.mean(skipna=True), len(thisSeason)), index=thisSeason.index) # smooth to a slowly changing seasonal value else: # drop NA values which result from step 1 in the decomp process thisSeason = thisSeason.dropna() # apply the seasonalSmoother thisSeasonSmoothed = pd.rolling_apply(thisSeason, window=lenS, func=lambda x: (x * centralS).sum(), min_periods=lenS, center=True) # for short series the above process results in no data ... find a simple mean if all(thisSeasonSmoothed.isnull()) : # same treatment as constant seasonal value above thisSeasonSmoothed = pd.Series(np.repeat(thisSeason.mean(skipna=True), len(thisSeason)), index=thisSeason.index) # handle the end-point problem ... for i in range(kS-1) : if np.isnan(thisSeasonSmoothed.iat[i]) : thisSeasonSmoothed.iat[i] = (thisSeason.iloc[0: i+kS] * (seasonalSmoother[i][::-1])).sum() # note: reversed order at start for i in range(len(thisSeason)-1, len(thisSeason)-kS, -1) : if np.isnan(thisSeasonSmoothed.iat[i]) : thisSeasonSmoothed.iat[i] = ( thisSeason.iloc[i-(kS-1):len(thisSeason)] * seasonalSmoother[len(thisSeason)-1-i]).sum() # package up season by season ... result[newColumn] = result[newColumn].where(result[newColumn].notnull(), other=thisSeasonSmoothed) return (result) # --- extend seasonal components to the full length of series def _extendSeries(result, periods, columnToBeExtended, newColumn): result[newColumn] = result[columnToBeExtended].copy() def fillup(result, fill, startPoint, endPoint): i = startPoint while True: p = result.index[i] result[newColumn].iat[i] = fill[newColumn].at[result['period'].iat[i]] if p >= endPoint: break i += 1 # back-cast if np.isnan(result.iat[0, result.columns.get_loc(newColumn)]): fill = pd.DataFrame(result[newColumn].dropna().iloc[0:periods]) fill['period'] = result['period'][fill.index[0]:fill.index[len(fill)-1]] endPoint = fill.index[0] - 1 fill.index = fill['period'] fillup(result=result, fill=fill, startPoint=0, endPoint=endPoint) # forward-cast if np.isnan(result.iat[len(result)-1, result.columns.get_loc(newColumn)]): fill = result[newColumn].dropna() fill = pd.DataFrame(fill[len(fill)-periods:len(fill)]) fill['period'] = result['period'][fill.index[0]:fill.index[len(fill)-1]] startPoint = result.index.get_loc(fill.index[len(fill)-1] + 1) fill.index = fill['period'] endPoint = result.index[len(result)-1] fillup(result=result, fill=fill, startPoint=startPoint, endPoint=endPoint) return (result)
Monday, June 23
Henderson Moving Average
I have posted my R code for a Henderson moving average here. This is the same code in python.
## Henderson.py
## calculate a Henderson moving average
import pandas as pd
import numpy as np
def hmaSymmetricWeights(n):
""" derive an n-term array of symmetric 'Henderson Moving Average' weights
formula from ABS (2003), 'A Guide to Interpreting Time Series', page 41.
returns a numpy array of symmetric Henderson weights indexed from 0 to n-1"""
# calculate the constant denominator and terms
m = int((n-1)//2) # the mid point - n must be odd
m1 = (m+1)*(m+1)
m2 = (m+2)*(m+2)
d = float(8*(m+2)*(m2-1)*(4*m2-1)*(4*m2-9)*(4*m2-25))
m3 = (m+3)*(m+3)
# calculate the weights
w = np.repeat(np.nan, n) # Actually indexed from 0 to n-1
for j in range(m+1):
j2 = j*j
v = (315*(m1-j2)*(m2-j2)*(m3-j2)*(3*m2-11*j2-16))/d
w[(m+j)] = v
if j > 0:
w[(m-j)] = v
w.flags.writeable = False # let's make it quasi-immutable
return (w)
def hmaAsymmetricWeights(m, w):
"""calculate the asymmetric end-weights
w --> an array of symmetrical henderson weights (from above function)
m --> the number of asymmetric weights sought; where m < len(w);
returns a numpy array of asymmetrical weights, indexed from 0 to m-1;
formula from Mike Doherty (2001), 'The Surrogate Henderson Filters in X-11',
Aust, NZ J of Stat. 43(4), 2001, pp901-999; see formula (1) on page 903"""
n = len(w) # the number of weights
# - some quick sanity checks
if m >= n:
raise ValueError('The m argument must be less than n')
if m <= int((n-1)//2):
raise ValueError('The m argument must be greater than (n-1)/2')
# --- let's build up Doherty's formula (1) from the top of page 903
# - the second chunk of the formula
sumResidual = w[range(m, n)].sum() / float(m)
# - the last chunk of the formula
sumEnd = 0.0
for i in range(m+1, n+1):
sumEnd += (float(i)-((m+1.0)/2.0)) * w[i-1] # w indexed from 0 to n-1
# - the beta squared / sigma squared - formula at the bottom of page 904
ic = 1.0
if n >= 13 and n < 15:
ic = 3.5
elif n >= 15:
ic = 4.5
b2s2 = (4.0/np.pi)/(ic*ic)
# - the gnarly bit in the middle of the formula
denominator = 1.0 + ((m*(m-1.0)*(m+1.0) / 12.0 ) * b2s2)
u = np.repeat(np.nan, m) # return series - created empty
for r in range(m): # r ranges 0 to m-1; but the formulae assumes 1 to m
numerator = ((r+1.0) - (m+1.0)/2.0) * b2s2
# - finally putting it all together
u[r] = w[r] + sumResidual + ( numerator / denominator ) * sumEnd
u.flags.writeable = False # let's make it quasi-immutable
return (u)
def Henderson(s, n):
""" Calculate an n-term Henderson Moving Average for the Series s
Note: we blithely assume s is ordered, contiguous and without missing data"""
# - some simple sanity checks
if not isinstance(s, pd.core.series.Series):
raise TypeError('The s argument should be a pandas Series')
if not isinstance(n, int):
raise TypeError('The n argument must be an integer')
if n < 5:
raise ValueError('The n argument must be >= 5')
if n % 2 == 0:
raise ValueError('The n argument must be odd')
if len(s) < n:
raise ValueError('The s argument should be a Series longer than n')
# - calculate the symmetric weights
w = hmaSymmetricWeights(n)
# preliminaries
r = pd.Series(np.repeat(np.nan, len(s)), index=s.index) # the empty return vehicle
m = int((n-1)//2)
l = len(s)
# - and now move over the length of the series ...
for i in range(len(s)) :
if i < m:
# --- head section of series
u = hmaAsymmetricWeights(m+i+1, w)[::-1] # reverse - asymmetric to the left
r.iloc[i] = (s.iloc[0:(i+m+1)] * u).sum()
elif i + m >= l:
# --- tail section of series
u = hmaAsymmetricWeights(m+l-i, w)
r.iloc[i] = (s.iloc[(i-m):l] * u).sum()
else:
# --- middle section of series
r.iloc[i] = (s.iloc[(i-m):(i+m+1)] * w).sum()
return (r)
### - test code
#--------------
# Check against Table 1 in B Quenneville and B Lefrancois (2001),
# "Implicit Forecasts in Musgrave Asymmetric Averages",
# Proceedings of the Annual Meeting of the American Statistical Association,
# August 5-9, 2001.
#--------------
#w = hmaSymmetricWeights(9)
#print(w)
#print(w.sum()) # should be one
#u = hmaAsymmetricWeights(7, w)
#print(u)
#print(u.sum()) # should be one
#--------------
#print (Henderson(pd.Series(range(30))+pd.Series(np.random.randn(30)), 13))
Saturday, January 25
Sunday, January 5
Pandas DataFrame cheat sheet and the Python v R debate
This has taken a lot longer than I thought it would. But I now have a rough, early draft of a cheat sheet on the Python pandas DataFrame object. When dealing with such a rich environment, it is a challenging decision on what to include and exclude in a four page set of notes. As always, comments and suggestions for improvement are always welcome. Doubtless there are many typographic errors that would be eradicated with another set of eyes and a ruthless proof reading of the draft text.
I have also been thinking about the Python versus R debate. Which is the better tool for data analysis? Clearly, this is one of those questions to which there is no ultimate truth. Your answer will depend on what sort of analysis you do. My analytical interest is primarily data capture and visualization. While I do some simple statistical analyses, I do not use R in a deep statistical way. I do, however, depend heavily on Hadley Wickham's excellent graphics package ggplot2.
Before coming to the question, I should expose my relative experience with the two languages. I have been using R as my primary analytical language for a number of years. Over the same period I have used Python as well; albeit less frequently, and largely for data munging (loading data from various data sources into a MySQL server from which R can access it).
On the criterion of data munging, Python beats R hands down. Python is simply better able to manage dirty data in all its forms and is better for automating the cleaning and loading operations.
Another area where I think Python beats R is coding time. My impression is that I code at least twice as fast in Python. I spend much less time debugging code. I became a very defensive coder in R. Every function I wrote religiously tested the arguments to see that the values were of the right type and within the expected ranges. It was not unusual to start a function with 6 to 12 stopifnot() statements. Even so, I simply code faster in Python. There are a few reasons for this. While both languages are very expressive (compared with C, C++ or Java), I find Python the more expressive language. List comprehensions and generator expressions are powerful tools for tight code. While environments in R come close, they are no where as natural to use as dictionaries in Python. Second, Python's much stronger typing better protects me from my poor typing skills (no pun intended). Third, my learning curve with Python was much shorter than for R. But again, this may just be a product of my background (as someone coming from the C, C++, Objective-C and Java programming paradigms).
On graphics, I think Hadley Wickham's ggplot2 beats the competition from Python in the form of matplotlib. But work is afoot to replicate ggplot2 in the Python environment. When that work is well progressed I might just change ships.
Another area where R leads is idiomatic coherence. While the pandas DataFrame object is an immensely rich environment, it feels cobbled together and a little rough at the edges (it does not feel coherently designed from the ground up). Take the myriad of indexing options: [], .loc[], .iloc[], .ix[], .iat[], .at[], .xs[], which feel like the maze of twisty little passages, all alike, but each a little different. And then there are the confusing rules (for example, single indexes on a DataFrame return columns but single sliced indexes return rows). Furthermore, the Pythonic notion of container truthiness was not maintained by pandas (in the rest of Python, empty containers are False while non-empty containers are True, regardless of what they contain). I could go on. But, simply put, data.frames in R are more coherent with the rest of the R language compared with the DataFrame object and the rest of Python.
And another point of comparison in R's favour: the R help system is much more helpful than its counterpart from Python and pandas.
Finally there is something of the nanny state that annoys the hell out of me in both Hadley Wickham's ggplot2 and Wes McKinney's DataFrame. Hadley won't let you plot a chart with two y-axes. Wes, won't let you plot a time series as a bar chart. I can see the arguments for both protections. But really, is prohibition needed? Ironically, you can commit Hadley's unforgivable sin under Wes' DataFrame. And Hadley will happily let you plot a time series as a bar chart. It is time both Hadley and Wes embraced liberalism.
I have also been thinking about the Python versus R debate. Which is the better tool for data analysis? Clearly, this is one of those questions to which there is no ultimate truth. Your answer will depend on what sort of analysis you do. My analytical interest is primarily data capture and visualization. While I do some simple statistical analyses, I do not use R in a deep statistical way. I do, however, depend heavily on Hadley Wickham's excellent graphics package ggplot2.
Before coming to the question, I should expose my relative experience with the two languages. I have been using R as my primary analytical language for a number of years. Over the same period I have used Python as well; albeit less frequently, and largely for data munging (loading data from various data sources into a MySQL server from which R can access it).
On the criterion of data munging, Python beats R hands down. Python is simply better able to manage dirty data in all its forms and is better for automating the cleaning and loading operations.
Another area where I think Python beats R is coding time. My impression is that I code at least twice as fast in Python. I spend much less time debugging code. I became a very defensive coder in R. Every function I wrote religiously tested the arguments to see that the values were of the right type and within the expected ranges. It was not unusual to start a function with 6 to 12 stopifnot() statements. Even so, I simply code faster in Python. There are a few reasons for this. While both languages are very expressive (compared with C, C++ or Java), I find Python the more expressive language. List comprehensions and generator expressions are powerful tools for tight code. While environments in R come close, they are no where as natural to use as dictionaries in Python. Second, Python's much stronger typing better protects me from my poor typing skills (no pun intended). Third, my learning curve with Python was much shorter than for R. But again, this may just be a product of my background (as someone coming from the C, C++, Objective-C and Java programming paradigms).
On graphics, I think Hadley Wickham's ggplot2 beats the competition from Python in the form of matplotlib. But work is afoot to replicate ggplot2 in the Python environment. When that work is well progressed I might just change ships.
Another area where R leads is idiomatic coherence. While the pandas DataFrame object is an immensely rich environment, it feels cobbled together and a little rough at the edges (it does not feel coherently designed from the ground up). Take the myriad of indexing options: [], .loc[], .iloc[], .ix[], .iat[], .at[], .xs[], which feel like the maze of twisty little passages, all alike, but each a little different. And then there are the confusing rules (for example, single indexes on a DataFrame return columns but single sliced indexes return rows). Furthermore, the Pythonic notion of container truthiness was not maintained by pandas (in the rest of Python, empty containers are False while non-empty containers are True, regardless of what they contain). I could go on. But, simply put, data.frames in R are more coherent with the rest of the R language compared with the DataFrame object and the rest of Python.
And another point of comparison in R's favour: the R help system is much more helpful than its counterpart from Python and pandas.
Finally there is something of the nanny state that annoys the hell out of me in both Hadley Wickham's ggplot2 and Wes McKinney's DataFrame. Hadley won't let you plot a chart with two y-axes. Wes, won't let you plot a time series as a bar chart. I can see the arguments for both protections. But really, is prohibition needed? Ironically, you can commit Hadley's unforgivable sin under Wes' DataFrame. And Hadley will happily let you plot a time series as a bar chart. It is time both Hadley and Wes embraced liberalism.
Wednesday, January 1
Reading ABS Excel files in pandas
Experimental code follows for reading excel files from the Australian Bureau of Statistics.
### ABSExcelLoader.py ### written in python 2.7 and pandas 0.12 import pandas as pd class ABSExcelLoader: def load(self, pathName, freq='M', index_name=None, verbose=False): """return a dictionary of pandas DataFrame objects for each Data work sheet in an ABS Excel spreadsheet""" wb = pd.ExcelFile(pathName) returnDict = {} for name in wb.sheet_names: if not 'Data' in name: continue df = wb.parse(name, skiprows=8, header=9, index_col=0) periods = pd.PeriodIndex(pd.Series(df.index), freq=freq) df.set_index(keys=periods, inplace=True) df.index.name = index_name returnDict[name] = df if verbose: print ("\nFile: '{}', sheet: '{}'".format(pathName, name)) print (df.iloc[:min(5, len(df)), :min(5, len(df.columns))]) return returnDict
Subscribe to:
Posts (Atom)