Mark the Graph: R code for reading ABS spreadsheets

Wednesday, October 30

R code for reading ABS spreadsheets

I have been playing with some of my utility functions lately. The following function is one I am testing with data from the Australian Bureau of Statistics. I use it to load an ABS spreadsheet from my local file system into an R data frame. Almost all ABS spreadsheets have the meta data and actual data in the same row/column format.

When the ABS releases new data, my usual practice is to download the spreadsheet to my local hard drive. I then run the appropriate R graphing program for that dataset. All my graphing programs begin with a call to functions to get the ABS data into an R data frame.

The first half of the function is sanity checks. I use these a lot in R. It saves time down the track by helping avoid the many obscure R bugs that can occur. The second half of the function restructures the data within the data frame (essentially making it useful).

readABS <- function(fileName=NULL, sheet=2, output=NULL) {
    # load the standard format ABS Excel spreadsheet into an R data.frame

    ### --- upfront sanity checks (programming by contract)
    if(is.null(fileName)) {
        warning('readABS(fileName): fileName name not specified')
        return(NULL)
    }

    if(length(fileName) > 1) {
        warning('readABS(fileName): not vectorised, only the first fileName will be opened')
        fileName <- fileName[1] # truncate to scalar
    }

    if(length(sheet) > 1) {
        warning('readABS(fileName): not vectorised, only the first sheet will be opened')
        sheet <- sheet[1] # truncate to scalar
    }

    if(is.na(fileName) || !is.character(fileName)) {
        warning('readABS(fileName): fileName name must be a character string')
        return(NULL)
    }

    if(!file.exists(fileName)) {
        warning(paste('readABS(fileName): fileName does not exist --> ', fileName))
        return(NULL)
    }        

    if(!('gdata' %in% rownames(installed.packages())))
        stop('readABS(fileName): the gdata package is not installed')
    
    if(!('zoo' %in% rownames(installed.packages())))
        stop('readABS(fileName): the zoo package is not installed')
    
    ### --- lets do the heavy lifting
    require('gdata') 
    df <- read.xls(fileName, sheet, stringsAsFactors=FALSE)

    # minor sanity check
    if(nrow(df) < 10) {
        warning('readABS(fileName): something is odd; ABS spreadsheets normally have 10+ rows')
        return(df)
    }        
    
    # use ABS serial number as column/variable name
    colnames(df) <- as.character(df[9, ]) 
    colnames(df)[1] <- 't' # special case for the first column
    
    # keep an optional record of the meta-data - may be useful for debugging
    if(!is.null(output) && !any(is.na(output))) {
        if(ncol(df) >= 2) {
            meta.data <- data.frame(stringsAsFactors=FALSE)
            for(i in 2:ncol(df)) {
                meta.data <- rbind(meta.data, 
                    data.frame(index=as.character(df[[9, i]]), 
                        metaData=paste0(df[1:9, i], sep='|', collapse=''),
                        stringsAsFactors=FALSE))
            }
            write.csv(meta.data, file=as.character(output), row.names=FALSE)
        }
    }
    
    # delete the meta-data rows - ie. exclude rows 1 through 9
    df <- df[-1:-9, ] 
    
    # make the columns numeric - convert characters back to numbers
    if(ncol(df) >= 2)
        for(i in 2:ncol(df))
            df[[i]] <- as.numeric(as.character(df[[i]]))
    
    # fix dates - currently character strings in the format: Jun-1981
    # set date to middle of the month or quarter
    df$t <- as.Date(paste('15-', df$t, sep=''), format='%d-%b-%Y', frac=0.5)
    if(nrow(df)>2) {
        d <- as.numeric(df[[2, 't']] - df[[1, 't']])
        if(d > 85 && d < 95) {
            # time series appears to be quarterly ...
            require('zoo')
            m <- format(as.yearmon(df[[1, 't']]), '%b')
            if(m %in% c('Mar', 'Jun', 'Sep', 'Dec')) {
                t <- as.Date(as.yearmon(df$t), frac=1.0)
                df$t <- as.Date(as.yearqtr(t), frac=0.5)
            }
            else
                df$t <- as.Date(df$t - 30, frac=0.5) # middle of middle month in quarter
        }
    }
    
    # fix row numbers
    if(nrow(df) >= 1)
        rownames(df) <- 1:nrow(df)
    
    return(df)
}

Post Script

I should explain, this is not the method I use to read ABS files. I have a series of python routines I use to load the ABS spreadsheets into a MySQL database. I then access them in R from the MySQL database.

This was a test piece of code to see if I could read the ABS spreadsheets direct from R. It turns out that this code rounds the numbers as they appear (visually) in the spreadsheet, rather than capture the double precision accuracy that is available within ABS spreadsheets. In my view, this is a significant limitation.

Mark the Graph

Wednesday, October 30

R code for reading ABS spreadsheets

No comments:

Post a Comment