Currently in Australia there has been some news reports of Federal MPs abusing their expenses….
So I thought I would take a look at what publicly available data there is and then use the Microsoft Excel Power* (PowerQuery, PowerPivot, PowerView) tools to do some analysis.
This post is about using PowerQuery to pull data from the Department of Finance website into Excel and get it into shape to be used with PowerPivot.
I will be using the Department of Finance website, here, if you browse to this page you will see that there is a series of HTML tables that contain the expenditure broken down by type for each minister.
Using PowerQuery it is very easy to download this data, transform and shape it nicely and then have it
automagically automatically appear in Excel.
So here is how I did it using Excel 2013.
First I selected the PowerQuery tab in the Excel ribbon, then clicked on the “From Web” button. This will bring up a dialog box asking for a URL. I started with the link above and pasted that in.
That loads up the PowerQuery Query Editor, this is where all of the transformation happens.
Things to note about this screen:
- Down the left-hand side you have the Navigator – this shows all the HTML tables that PowerQuery “discovered”
- In the middle of the screen you have the data that will be loaded into Excel when you press the “Done” button (currently blank)
- On the right-hand side you have the steps that have been performed (this is expanded using the little arrow, there is also a handy little count of how many steps have been performed)
There were a number of steps that I went through to get the data ready for analysis.
The first was setting up the source, clicking “New South Wales” in the Navigator section will then display in the main grid all the data for ministers from “New South Wales”, but I want to combine that with ministers from ALL states and territories.
To combine multiple tables together:
- Click on New South Wales
- Expand the Steps
- Click on Source – notice that the formula bar changes to say
- This can be replaced with
After performing this step, I click the Refresh button, do some scrolling and I can see that ministers from Melbourne, Brisbane and cities in states other than New South Wales are now in the grid
Right, so now that the source has been setup it’s time for the second stage – the transformation stage.
Looking through the data you can see that the columns are all strings, I want the ones containing monetary values to be numeric so that later on I can SUM them, average them etc.
There are a number of steps to achieve this:
- Getting rid of all the $ symbols, I achieved this by selecting all the columns then right-clicking on one of the headers and select the “Replace Values” option. In here I’m specifying $ as the search value and leaving the replacement value empty.
- Now to convert them into a number – once again, select all the columns, right click on a heading then select Change Type and then Number
The next part of the transformation stage is to do some aesthetic cleansing, this is renaming the column headings. I just went through and renamed each one, removing the number and the extra whitespace.
As this set of data is going to be combined with data from other periods, I added a column to identify what period this data is from. Adding a new column is done by right-clicking the header row then Insert Column and Custom. I named the column Period and gave it the value “1 January to 30 June 2012”.
That completes the transformation stage and all the work that needs to be done in the Query Editor.
Clicking Done brings all the data into a new Excel sheet.
Now you might be thinking, well I could have just copied and pasted the data from the webpage straight into Excel and cut out this messing about, and while that is true for a lot of data from the web it misses out one of the key features of PowerQuery.
If you look closely at the screenshot of Excel above, you will notice on the right-hand side the Query Settings, in it there is a “Last updated” time and a “Refresh” button. When the refresh button is clicked, PowerQuery will download the HTML document again, and process it using the steps defined above, removing the need for any manual cleanup.
It’s hassle free 🙂 and that is one of the reasons I like it.
In the next post I’m going to show how to pull in the data from the other periods so that in a future post I can do historical comparisons on expenditure.
(You can download the fully working spreadsheet from my dropbox)