# Mark the Graph

I like to plot!

## Saturday, July 4

## Tuesday, May 12

### Plotting time-series DataFrames in pandas

Pandas provides a convenience method for plotting DataFrames: DataFrame.plot. There is also a quick guide here.

Unfortunately, when it comes to time series data, I don't always find the convenience method convenient. I often have a sparse DataFrame with lots of NaNs, which are not ignored by the convenience method. Additionally, I don't like the way that matplotlib places the lines hard against the left and right-hand sides of the canvas. I like a little bit of space at each end of the chart. Finally, I like playing with the tick marks and tick labels to get the right density of information on the x-axis.

Rather than use the inconvenient convenience method, I regularly find myself writing a short function to produce the plot layout I find a little more aesthetically pleasing. An example chart (from Mark the Ballot) and the associated python code follows.

Unfortunately, when it comes to time series data, I don't always find the convenience method convenient. I often have a sparse DataFrame with lots of NaNs, which are not ignored by the convenience method. Additionally, I don't like the way that matplotlib places the lines hard against the left and right-hand sides of the canvas. I like a little bit of space at each end of the chart. Finally, I like playing with the tick marks and tick labels to get the right density of information on the x-axis.

Rather than use the inconvenient convenience method, I regularly find myself writing a short function to produce the plot layout I find a little more aesthetically pleasing. An example chart (from Mark the Ballot) and the associated python code follows.

import datetime import pandas as pd import numpy as np import matplotlib.pyplot as plt from matplotlib.dates import DateFormatter, YearLocator, MonthLocator plt.style.use('ggplot') def plot(df, filename, heading=None): fig, ax = plt.subplots(figsize=(8, 4)) min_date = None max_date = None for col_name in df.columns.values: # plot the column col = df[col_name] col = col[col.notnull()] # drop NAs dates = [zzz.to_timestamp().date() for zzz in col.index] ax.plot_date(x=dates, y=col, fmt='-', label=col_name, tz=None, xdate=True, ydate=False, linewidth=1.5) # establish the date range for the data if min_date: min_date = min(min_date, min(dates)) else: min_date = min(dates) if max_date: max_date = max(max_date, max(dates)) else: max_date = max(dates) # give a bit of space at each end of the plot - aesthetics span = max_date - min_date extra = int(span.days * 0.03) * datetime.timedelta(days=1) ax.set_xlim([min_date - extra, max_date + extra]) # format the x tick marks ax.xaxis.set_major_formatter(DateFormatter('%Y')) ax.xaxis.set_minor_formatter(DateFormatter('\n%b')) ax.xaxis.set_major_locator(YearLocator()) ax.xaxis.set_minor_locator(MonthLocator(bymonthday=1, interval=2)) # grid, legend and yLabel ax.grid(True) ax.legend(loc='best', prop={'size':'x-small'}) ax.set_ylabel('Percent') # heading if heading: fig.suptitle(heading, fontsize=12) fig.tight_layout(pad=1.5) # footnote fig.text(0.99, 0.01, 'marktheballot.blogspot.com.au', ha='right', va='bottom', fontsize=8, color='#999999') # save to file fig.savefig(filename, dpi=125)

## Sunday, May 3

### Using python statsmodels for OLS linear regression

This is a short post about using the python statsmodels package for calculating and charting a linear regression.

Let's

We can plot this simply ...

That was easy. Next we will add a

The key trick is at line 12: we need to add the intercept term explicitly. Without with this step, the regression model would be: y ~ x, rather than y ~ x + c. Similarly, at line 17, we include an intercept term in the data we provide to the predicting method at line 18. The sm.add_constant() method prepends a column of ones for the constant term in the regression model, returning a two column numpy array. The first column is ones, the second column is our original data from above.

If we wanted key data from the regression, the following would do the job, after line 15:

We can add a

To get the necessary t-statistic, I have imported the scipy stats package at line 27, and calculated the t-statistic at line 28.

The final step is a

Let's

*, which we will enter using iPython. We fake up normally distributed data around y ~ x + 10.***start with some dummy data**In [1]: import numpy as np In [2]: x = np.random.randn(100) In [3]: y = x + np.random.randn(100) + 10

We can plot this simply ...

In [4]: import matplotlib.pyplot as plt In [5]: fig, ax = plt.subplots(figsize=(8, 4)) In [6]: ax.scatter(x, y, alpha=0.5, color='orchid') Out[6]:In [7]: fig.suptitle('Example Scatter Plot') Out[7]: In [8]: fig.tight_layout(pad=2); In [9]: ax.grid(True) In [10]: fig.savefig('filename1.png', dpi=125)

That was easy. Next we will add a

*. We will use the statsmodels package to calculate the regression line. Lines 11 to 15 is where we model the regression. Lines 16 to 20 we calculate and plot the regression line.***regression line**The key trick is at line 12: we need to add the intercept term explicitly. Without with this step, the regression model would be: y ~ x, rather than y ~ x + c. Similarly, at line 17, we include an intercept term in the data we provide to the predicting method at line 18. The sm.add_constant() method prepends a column of ones for the constant term in the regression model, returning a two column numpy array. The first column is ones, the second column is our original data from above.

In [11]: import statsmodels.api as sm In [12]: x = sm.add_constant(x) # constant intercept term In [13]: # Model: y ~ x + c In [14]: model = sm.OLS(y, x) In [15]: fitted = model.fit() In [16]: x_pred = np.linspace(x.min(), x.max(), 50) In [17]: x_pred2 = sm.add_constant(x_pred) In [18]: y_pred = fitted.predict(x_pred2) In [19]: ax.plot(x_pred, y_pred, '-', color='darkorchid', linewidth=2) Out[19]: [] In [20]: fig.savefig('filename2.png', dpi=125)

If we wanted key data from the regression, the following would do the job, after line 15:

print(fitted.params) # the estimated parameters for the regression line print(fitted.summary()) # summary statistics for the regression

We can add a

*. There is a 95 per cent probability that the true regression line for the population lies within the confidence interval for our estimate of the regression line calculated from the sample data. We will calculate this from scratch, largely because I am not aware of a simple way of doing it within the statsmodels package.***confidence interval for the regression**To get the necessary t-statistic, I have imported the scipy stats package at line 27, and calculated the t-statistic at line 28.

In [22]: y_hat = fitted.predict(x) # x is an array from line 12 above In [23]: y_err = y - y_hat In [24]: mean_x = x.T[1].mean() In [25]: n = len(x) In [26]: dof = n - fitted.df_model - 1 In [27]: from scipy import stats In [28]: t = stats.t.ppf(1-0.025, df=dof) In [29]: s_err = np.sum(np.power(y_err, 2)) In [30]: conf = t * np.sqrt((s_err/(n-2))*(1.0/n + (np.power((x_pred-mean_x),2) / ....: ((np.sum(np.power(x_pred,2))) - n*(np.power(mean_x,2)))))) In [31]: upper = y_pred + abs(conf) In [32]: lower = y_pred - abs(conf) In [33]: ax.fill_between(x_pred, lower, upper, color='#888888', alpha=0.4) Out[33]:In [34]: fig.savefig('filename3.png', dpi=125)

The final step is a

*. There is a 95 per cent probability that the real value of y in the population for a given value of x lies within the prediction interval. There is a statsmodels method in the sandbox we can use.***prediction interval**In [35]: from statsmodels.sandbox.regression.predstd import wls_prediction_std In [36]: sdev, lower, upper = wls_prediction_std(fitted, exog=x_pred2, alpha=0.05) In [37]: ax.fill_between(x_pred, lower, upper, color='#888888', alpha=0.1) Out[37]:In [38]: fig.savefig('filename4.png', dpi=125)

Labels:
Python,
statistics,
wonkish

## Sunday, April 5

## Saturday, February 21

### Load ABS files to MySQL

When I am working in R, I tend to have my working data in a MySQL database. I found that R did not always play nicely (and quickly) with complex Microsoft Excel files.

Previously, I had quite a complex bit of python code to read Excel files and upload them to MySQL. I have now retooled the way in which I load files from the Australian Bureau of Statistics (ABS) to MySQL using Python pandas. The code is much simpler.

First, I store my MySQL database username and password in a file

Now let's move on to the function to load ABS files into MySQL. It lives in the bin directory (../bin from where I do my work), in a file named

Finally, an example code snippet to load some of the ABS National Account files to MySQL. This files sits in my national accounts directory and has the rather unimaginative name

To run this python load file, I have a BASH shell script, which I use on my iMac. It has another unimaginative name:

Previously, I had quite a complex bit of python code to read Excel files and upload them to MySQL. I have now retooled the way in which I load files from the Australian Bureau of Statistics (ABS) to MySQL using Python pandas. The code is much simpler.

First, I store my MySQL database username and password in a file

**(it is used by a number of different programs). It lives in the bin directory (../bin from where I do my work). And, just in case you are wondering: no, it is not my password.***MysqlConnect.py*host = 'localhost' user = 'root' password = 'BigRedCar' database = 'dbase1'

Now let's move on to the function to load ABS files into MySQL. It lives in the bin directory (../bin from where I do my work), in a file named

**.***LoadABSToMySQL.py*import pandas as pd import pymysql from sqlalchemy import create_engine import os.path import re # local imports - a file that contains database login details import MysqlConnect as MSC def LoadABSToMySQL(pathName): """ Read an Excel file from the Australian Bureau of Statistics and load it into a MySQL database""" # --- 1 --- open MySQL s = 'mysql+pymysql://'+MSC.user+':'+MSC.password+'@'+MSC.host+'/'+MSC.database engine = create_engine(s) # --- 2 --- identify proposed table name from file name (head,tail) = os.path.split(pathName) tail = re.split('\.', tail) tablename = tail[0] # --- 3 --- open the XL file wb = pd.ExcelFile(pathName) # --- 4 --- load XL workbooks into a single DataFrame df = pd.DataFrame() for name in wb.sheet_names: # -- ignore junk if not 'Data' in name: continue # -- read tmp = wb.parse(sheetname=name, header=9, index_col=0, na_values=['', '-', ' ']) # -- amalgamate df = pd.merge(left=df, right=tmp, how='outer', left_index=True, right_index=True) tmp = None # --- 5 --- write this DataFrame to MySQL df.to_sql(tablename, engine, index=True, if_exists='replace')

Finally, an example code snippet to load some of the ABS National Account files to MySQL. This files sits in my national accounts directory and has the rather unimaginative name

**. The ABS Microsoft Excel files live in the ./raw-data sub-directory.***py-load.py*import sys sys.path.append( '../bin' ) from LoadABSToMySQL import LoadABSToMySQL dataDirectory = './raw-data/' dataFiles = [ '5206001_key_aggregates', '5206002_expenditure_volume_measures', '5206003_expenditure_current_price', '5206004_expenditure_price_indexes', '5206006_industry_gva', '5206008_household_final_consumption_expenditure', '5206022_taxes', '5206023_social_assistance_benefits', '5206024_selected_analytical_series' ] dataSuffix = '.xls' for f in dataFiles : LoadABSToMySQL(dataDirectory + f + dataSuffix)

To run this python load file, I have a BASH shell script, which I use on my iMac. It has another unimaginative name:

**.***run-load.sh*#!/bin/bash # mac os x fix ... cd "$(dirname "$0")" python ./py-load.py

## Friday, February 13

## Wednesday, February 11

### An Extra Dry Baltic Index

The Baltic Dry Index provides a window on world trade. The view out the window ain't that good at the moment.

The Baltic Dry Index was at 663 on 5 December 2008 (its post-GFC low point).

It was at 554 on 9 February 2015. But no need to worry, it rebounded back up to 556 on 10 February.

The Baltic Dry Index was at 663 on 5 December 2008 (its post-GFC low point).

It was at 554 on 9 February 2015. But no need to worry, it rebounded back up to 556 on 10 February.

Labels:
Baltic Dry

Subscribe to:
Posts (Atom)