Scraping the Web with Power Query

Last year I entered the PowerBI video demo contest. Whilst I didn’t win any prizes I did learn a fair bit from going through the exercise of putting together a screencast demo (more on that another time). In this post I’m going to walk-through the web scraping part of my demo.

The website that I choose to use for my demo was the National UFO Reporting Center (more for novelty sake then any serious interest). Below you can see the listing by shape of craft (


If I click on one of these links then details on each individual sighting is listed


The overall goal is to have all of the individual sightings retrieved from each individual page and stored into a Power Pivot table.

One thing which makes the exercise of scraping this site easy is that the index page links to each of the individual pages in a consistent manor:{SHAPE NAME HERE}.html

This means the process of scraping will follow these steps:

  1. Retrieve the list of shapes from the index page and store these
  2. Loop through each of the shapes from step 1:
    1. Dynamically generate the URI based on the shape name
    2. Retrieve the table of sightings

Retrieving the Shape index table

From Excel select the Power Query tab, then click the From Web button.


This will bring up a dialog asking for the URL to the webpage, enter the URL for the shape index:

Doing this used to take you directly into the Power Query editor, but now there is an extra step. In the right-hand side of Excel you will see a Navigator appear:


This is showing all the different elements that Power Query discovered on the web page, Table 0 contains a list of all the shapes, which you can see in the preview when you hover your mouse over it. Click Table 0 then click Edit to bring up the Query Editor.

At this point all I did was a few clean up tasks:

  • Removed the Count column
  • Renamed “Reports” to “Shape”
  • Renamed the Query to “Shape List”

I also un-ticked “Load to Worksheet”, as you will see why shortly.


Retrieving individual sightings

The next step is to parse the individual sighting pages, I’ll do this in two steps. In the first step I’ll just take one of the sighting pages and using it perform any transformations or clean up tasks, then in the second step I will show you how to turn this into a formula which will take a shape name, which will be used to dynamically generate a URL for retrieval, this formula will then be used in the next section.

For my example I will be using the Cone shaped sightings (

Once again use the From Web button and enter the above URL, then in the navigator select Table 0 and then click Edit to open the Query Editor.

I performed the following clean up tasks:

  • Removed “Posted” and “Shape” columns
  • Used the Split Column transform on the “Date / Time” (delimited by Space, at the left most)
  • Renamed the new columns to “Date” and “Time”

The dates are in a format that Power Query is unable to recognize so I had to perform some extra transformation steps to reformat them:

  • Used the Split Column transform on the “Date” column (delimited by custom, /, at each occurrence). This will split out each date part.
  • Changed the data type of these new columns to Text
  • Inserted a custom column with the following formula:

Date.From(Text.Combine({[Date.3],[Date.1],[Date.2]}, “/”))

  • Removed the “Date.1”, “Date.2”, “Date.3” columns
  • Moved the Date column to be in the first position


The next step is to turn this into a reusable formula. Bring up the Advanced Editor from the View tab in the ribbon.



To turn this into a function replace the first two rows with the following code:

fnGetByShape = (shapeName as text) =>
Source = Web.Page(Web.Contents(Text.Replace(“{shape}.html”, “{shape}”, shapeName))),


and insert the following at the end


These two code snippets will turn the query into a function. Click Done to close the Advanced Editor you will see that the the table of data that was there is now replaced with this:


You can test this out by clicking the Invoke button and entering in a shape name (e.g. Cone).

The last thing left to do is give the query a good name (fnGetSightingsByShape) and ensure that both the Load to Worksheet and Load to Data Model options are un-ticked.

Using the function

Now that both the list of shapes query and the function to retrieve each individual sighting are ready, it is time to combine them together and load the data into Power Pivot.

Open the Shape List query that was created earlier and add a custom column:


This will add a new column which calls the custom function. Expanding the Table in the new column will show the columns from the function (Date, Time, City, State, Duration, Summary). I had to set the data types for the Date and Time columns.

Now the only thing left to do is tick “Load to Data Model” then get a cuppa tea or coffee while Power Query loads the data into Power Pivot.

In Summary

In this post I’ve shown you how you can use Power Query to scrape data from a website, in addition I’ve also shown you how to build reusable functions which can be used to scrape different levels of pages on a website.

I’ve uploaded my finished demo to Dropbox for your enjoyment.

Scraping the Web with Excel 2013 – PowerBI Competition Entry

The Microsoft PowerBI Competition is now in full swing with the voting open to the public for the next week. (Check  out my entry).

As you can see below I just made my submission in time.

I like to cut it fine!
I like to cut it fine!

When I came to building my demo (check it out) I had a few different data sets in mind, but there were two main points that I wanted to highlight from my entry –

  1. How Power Query makes scraping webpages and extracting data into Excel really easy
  2. How Power View can be used to build visually stunning and interactive reports.

I ended up going with the National UFO Reporting Center ( They have very helpfully made their database available in plain old HTML tables, which makes it a rather trivial to retrieve using Power Query.

The biggest takeaway that I had from doing this is that doing screencasts is difficult! I have a new found appreciation for the Pluralsight authors who do an amazing job of lining up the timing of doing something on screen and narrating it.

Find sysadmins Hiding in Active Directory with PowerQuery

Chances are if you are using SQL Server in an enterprise environment you are probably using Active Directory groups to setup and maintain permissions, but how as a DBA do you know who the users are behind the groups and many there actually are? This is especially important for the sysadmin permission.

In this article I’m going to show how to use PowerQuery to find all the users in Active Directory that have the sysadmin permission. As Brent Ozar puts it “these are the people that can get you fired”.

NB: I will be using the November update for PowerQuery.

Get a list of sysadmins

Open Excel 2013 and on the PowerQuery tab select “From Data” and then “From SQL Server Database”. In the window that is then displayed, enter your server name, master for the database then the following SQL:

SELECT, sp.is_disabled,, spm.type_desc
  sys.server_role_members srm,
  sys.server_principals sp,
  sys.server_principals spm
WHERE = 'sysadmin' AND
  sp.principal_id = srm.role_principal_id AND
  spm.principal_id = srm.member_principal_id AND
  spm.type_desc = 'WINDOWS_GROUP';


Click OK to load a preview into the Query Editor.

Split the name column, using \ as the custom separator. Then rename the columns to: Domain, GroupName, IsDisabled, Permission. Remove the type_desc column.

You should end up with something like this:


Click Apply & Close to load the results into Excel.

Creating a PowerQuery Function using Active Directory

The next step is to create a PowerQuery function which will query Active Directory for a given domain and group and return the members.

I found the easiest way to create this as a function was to go through all the steps for the first group then using the Advanced Query Editor to turn the M code into a generic function, which could then be reused.

This was the function that I ended up creating:

    GetADUsers = (domainName as text, groupName as text, ou as text, dc as text) =>
    Source = ActiveDirectory.Domains(domainName),
    theDomain = Source{[Domain=domainName]}[#"Object Categories"],
    group = theDomain{[Category="group"]}[Objects],
    InsertedCustom1 = Table.AddColumn(group, "distinguishedNameLower", each Text.Lower([distinguishedName])),
    ABC = InsertedCustom1{[distinguishedNameLower=Text.Lower(Text.Replace(Text.Replace(Text.Replace("CN={GROUPNAME},OU=Groups,OU={OU},DC={DOMAIN},DC=priv", "{GROUPNAME}", groupName), "{DOMAIN}", dc), "{OU}", ou))]}[group],
    member = ABC[member],
    TableFromList = Table.FromList(member, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    DEF = Table.ExpandRecordColumn(TableFromList, "Column1", {"displayName", "distinguishedName"}, {"Column1.displayName", "Column1.distinguishedName"}),
    InsertedCustom = Table.AddColumn(DEF, "Custom", each Text.Contains([Column1.distinguishedName],"OU=Users")),
    RenamedColumns = Table.RenameColumns(InsertedCustom,{{"Custom", "IsUser"}}),
    InsertedCustom4 = Table.AddColumn(RenamedColumns, "Custom.3", each Text.Range([Column1.distinguishedName], Text.PositionOf([Column1.distinguishedName],"DC=")+3,Text.Length([Column1.distinguishedName])-Text.PositionOf([Column1.distinguishedName],",DC=priv")+4)),
    RenamedColumns2 = Table.RenameColumns(InsertedCustom4,{{"Custom.3", "Domain"}}),
    RemovedColumns1 = Table.RemoveColumns(RenamedColumns2,{"Column1.distinguishedName"}),
    ReorderedColumns = Table.ReorderColumns(RemovedColumns1,{"Column1.displayName", "Domain", "IsUser"}),
    RenamedColumns1 = Table.RenameColumns(ReorderedColumns,{{"Column1.displayName", "DisplayName"}})

You can paste this directly into a blank query, and then test it by using the Invoke button in the Query Editor.


If you do test the function using the invoke make sure you remove this step, then click Apply & Close.

Combining the sysadmin Active Directory groups and the PowerQuery function

In this final step the list of Active Directory groups with the sysadmin permission is combined with the function created above.

Before this can be done a mapping table needs to be created for the domain name that SQL Server outputs and what it actually is. For example, as far as Active Directory is concerned your domain might be mydomain.local or mydomain.priv, however SQL Server is only going to say mydomain\groupname.

So, I created a very simple table in an Excel sheet with four columns (Domain, FullyQualifiedDomain, OU, DC). Which looks something like this:


NB: To merge this table with the output from the sysadmins query the value in the Domain columns need to match exactly.

With this table selected, I clicked the “From Table” option in the PowerQuery ribbon to bring up the Query Editor so that I could merge this table with the results of the sysadmin query.

To merge this table, click the Merge button, then select the query which corresponds to the sysadmins in the drop down, then click the “Domain” column heading in each table.


Click OK and a new column appears, I then expanded the column, with all the new columns of the table selected.

To use the GetADUsers function, add a custom column and call the function


Click OK, then expand the newly added column.

I then cleaned things up a bit by renaming some of the columns and removing the unnecessary ones. The final result ends up like this:


Click Apply & Close to load the result into Excel and send it off to your boss / security team 🙂

Useful Resources:

PowerQuery – Turning My Frown Upside Down

Microsoft has started to include this very handy button in all of their new Excel add-ins.


It is for people to send feedback to the Microsoft developers about problems they are having, problems they are solving or if they are generally happy with the product.

This week I was using PowerQuery to see if I could use it to query Active Directory (AD) and list all the members of a specific AD Group (more on this in a future post). However, whilst I was able to get PowerQuery to list the members of some AD groups it wouldn’t list them for the groups I was actually interested in.


I reached out to a friend of mine Niko Neugebauer (twitter | web), he took a look at what I had done and couldn’t see anything obviously wrong with it and so he asked me “did you try sending a frown feedback to the dev team?”. Until he actually suggested that it didn’t cross my mind.

I promptly clicked the “Send a frown”, this brought up an email already typed out, with a screenshot of what I was doing and the formulas I had written, I added some more detail and sent it off.

I also posted on the PowerQuery TechNet forum.

The next day I received a response to my email from the PowerQuery team asking me some questions, at the same time someone who works on the PowerQuery team reached out to me on Twitter and there were two responses waiting for me on the forum, one of which was the solution.

My frown was turned upside down (to a smile), and I learnt a lesson – it’s ok to send a frown from time to time.

The answer was to turn on Fast Combine, but more on that another time. Thanks to all of you who helped me out 🙂

Analysing Australian Parliamentarians’ Expenditure with PowerQuery

Currently in Australia there has been some news reports of Federal MPs abusing their expenses….

So I thought I would take a look at what publicly available data there is and then use the Microsoft Excel Power* (PowerQuery, PowerPivot, PowerView) tools to do some analysis.

This post is about using PowerQuery to pull data from the Department of Finance website into Excel and get it into shape to be used with PowerPivot.

I will be using the Department of Finance website, here, if you browse to this page you will see that there is a series of HTML tables that contain the expenditure broken down by type for each minister.

Using PowerQuery it is very easy to download this data, transform and shape it nicely and then have it automagically automatically appear in Excel.

So here is how I did it using Excel 2013.

First I selected the PowerQuery tab in the Excel ribbon, then clicked on the “From Web” button. This will bring up a dialog box asking for a URL. I started with the link above and pasted that in.


That loads up the PowerQuery Query Editor, this is where all of the transformation happens.


Things to note about this screen:

  • Down the left-hand side you have the Navigator – this shows all the HTML tables that PowerQuery “discovered”
  • In the middle of the screen you have the data that will be loaded into Excel when you press the “Done” button (currently blank)
  • On the right-hand side you have the steps that have been performed (this is expanded using the little arrow, there is also a handy little count of how many steps have been performed)

There were a number of steps that I went through to get the data ready for analysis.

The first was setting up the source, clicking “New South Wales” in the Navigator section will then display in the main grid all the data for ministers from “New South Wales”, but I want to combine that with ministers from ALL states and territories.

To combine multiple tables together:

  • Click on New South Wales
  • Expand the Steps
  • Click on Source – notice that the formula bar changes to say

= Source{0}[Data]

  • This can be replaced with

= Source{0}[Data]&Source{1}[Data]&Source{2}[Data]&Source{3}[Data]&Source{4}[Data]&Source{5}[Data]&Source{6}[Data]&Source{7}[Data]

After performing this step, I click the Refresh button, do some scrolling and I can see that ministers from Melbourne, Brisbane and cities in states other than New South Wales are now in the grid


Right, so now that the source has been setup it’s time for the second stage – the transformation stage.

Looking through the data you can see that the columns are all strings, I want the ones containing monetary values to be numeric so that later on I can SUM them, average them etc.

There are a number of steps to achieve this:

  • Getting rid of all the $ symbols, I achieved this by selecting all the columns then right-clicking on one of the headers and select the “Replace Values” option. In here I’m specifying $ as the search value and leaving the replacement value empty.
  • Now to convert them into a number – once again, select all the columns, right click on a heading then select Change Type and then Number

The next part of the transformation stage is to do some aesthetic cleansing, this is renaming the column headings. I just went through and renamed each one, removing the number and the extra whitespace.

As this set of data is going to be combined with data from other periods, I added a column to identify what period this data is from. Adding a new column is done by right-clicking the header row then Insert Column and Custom. I named the column Period and gave it the value “1 January to 30 June 2012”.


That completes the transformation stage and all the work that needs to be done in the Query Editor.

Clicking Done brings all the data into a new Excel sheet.


Now you might be thinking, well I could have just copied and pasted the data from the webpage straight into Excel and cut out this messing about, and while that is true for a lot of data from the web it misses out one of the key features of PowerQuery.

If you look closely at the screenshot of Excel above, you will notice on the right-hand side the Query Settings, in it there is a “Last updated” time and a “Refresh” button. When the refresh button is clicked, PowerQuery will download the HTML document again, and process it using the steps defined above, removing the need for any manual cleanup.

It’s hassle free 🙂 and that is one of the reasons I like it.

In the next post I’m going to show how to pull in the data from the other periods so that in a future post I can do historical comparisons on expenditure.

(You can download the fully working spreadsheet from my dropbox)