When the ABS releases new data, my usual practice is to download the spreadsheet to my local hard drive. I then run the appropriate R graphing program for that dataset. All my graphing programs begin with a call to functions to get the ABS data into an R data frame.
The first half of the function is sanity checks. I use these a lot in R. It saves time down the track by helping avoid the many obscure R bugs that can occur. The second half of the function restructures the data within the data frame (essentially making it useful).
readABS <- function(fileName=NULL, sheet=2, output=NULL) {
# load the standard format ABS Excel spreadsheet into an R data.frame
### --- upfront sanity checks (programming by contract)
if(is.null(fileName)) {
warning('readABS(fileName): fileName name not specified')
return(NULL)
}
if(length(fileName) > 1) {
warning('readABS(fileName): not vectorised, only the first fileName will be opened')
fileName <- fileName[1] # truncate to scalar
}
if(length(sheet) > 1) {
warning('readABS(fileName): not vectorised, only the first sheet will be opened')
sheet <- sheet[1] # truncate to scalar
}
if(is.na(fileName) || !is.character(fileName)) {
warning('readABS(fileName): fileName name must be a character string')
return(NULL)
}
if(!file.exists(fileName)) {
warning(paste('readABS(fileName): fileName does not exist --> ', fileName))
return(NULL)
}
if(!('gdata' %in% rownames(installed.packages())))
stop('readABS(fileName): the gdata package is not installed')
if(!('zoo' %in% rownames(installed.packages())))
stop('readABS(fileName): the zoo package is not installed')
### --- lets do the heavy lifting
require('gdata')
df <- read.xls(fileName, sheet, stringsAsFactors=FALSE)
# minor sanity check
if(nrow(df) < 10) {
warning('readABS(fileName): something is odd; ABS spreadsheets normally have 10+ rows')
return(df)
}
# use ABS serial number as column/variable name
colnames(df) <- as.character(df[9, ])
colnames(df)[1] <- 't' # special case for the first column
# keep an optional record of the meta-data - may be useful for debugging
if(!is.null(output) && !any(is.na(output))) {
if(ncol(df) >= 2) {
meta.data <- data.frame(stringsAsFactors=FALSE)
for(i in 2:ncol(df)) {
meta.data <- rbind(meta.data,
data.frame(index=as.character(df[[9, i]]),
metaData=paste0(df[1:9, i], sep='|', collapse=''),
stringsAsFactors=FALSE))
}
write.csv(meta.data, file=as.character(output), row.names=FALSE)
}
}
# delete the meta-data rows - ie. exclude rows 1 through 9
df <- df[-1:-9, ]
# make the columns numeric - convert characters back to numbers
if(ncol(df) >= 2)
for(i in 2:ncol(df))
df[[i]] <- as.numeric(as.character(df[[i]]))
# fix dates - currently character strings in the format: Jun-1981
# set date to middle of the month or quarter
df$t <- as.Date(paste('15-', df$t, sep=''), format='%d-%b-%Y', frac=0.5)
if(nrow(df)>2) {
d <- as.numeric(df[[2, 't']] - df[[1, 't']])
if(d > 85 && d < 95) {
# time series appears to be quarterly ...
require('zoo')
m <- format(as.yearmon(df[[1, 't']]), '%b')
if(m %in% c('Mar', 'Jun', 'Sep', 'Dec')) {
t <- as.Date(as.yearmon(df$t), frac=1.0)
df$t <- as.Date(as.yearqtr(t), frac=0.5)
}
else
df$t <- as.Date(df$t - 30, frac=0.5) # middle of middle month in quarter
}
}
# fix row numbers
if(nrow(df) >= 1)
rownames(df) <- 1:nrow(df)
return(df)
}
Post Script
I should explain, this is not the method I use to read ABS files. I have a series of python routines I use to load the ABS spreadsheets into a MySQL database. I then access them in R from the MySQL database.
This was a test piece of code to see if I could read the ABS spreadsheets direct from R. It turns out that this code rounds the numbers as they appear (visually) in the spreadsheet, rather than capture the double precision accuracy that is available within ABS spreadsheets. In my view, this is a significant limitation.
No comments:
Post a Comment