Scraping the Web with Power Query

Last year I entered the PowerBI video demo contest. Whilst I didn’t win any prizes I did learn a fair bit from going through the exercise of putting together a screencast demo (more on that another time). In this post I’m going to walk-through the web scraping part of my demo.

The website that I choose to use for my demo was the National UFO Reporting Center (more for novelty sake then any serious interest). Below you can see the listing by shape of craft (www.nuforc.org/webreports/ndxshape.html).

1-index-by-shape

If I click on one of these links then details on each individual sighting is listed

2-sample-details

The overall goal is to have all of the individual sightings retrieved from each individual page and stored into a Power Pivot table.

One thing which makes the exercise of scraping this site easy is that the index page links to each of the individual pages in a consistent manor:

http://www.nuforc.org/webreports/ndxs{SHAPE NAME HERE}.html

This means the process of scraping will follow these steps:

  1. Retrieve the list of shapes from the index page and store these
  2. Loop through each of the shapes from step 1:
    1. Dynamically generate the URI based on the shape name
    2. Retrieve the table of sightings

Retrieving the Shape index table

From Excel select the Power Query tab, then click the From Web button.

2a-from-web

This will bring up a dialog asking for the URL to the webpage, enter the URL for the shape index: http://www.nuforc.org/webreports/ndxshape.html

Doing this used to take you directly into the Power Query editor, but now there is an extra step. In the right-hand side of Excel you will see a Navigator appear:

3-navigator

This is showing all the different elements that Power Query discovered on the web page, Table 0 contains a list of all the shapes, which you can see in the preview when you hover your mouse over it. Click Table 0 then click Edit to bring up the Query Editor.

At this point all I did was a few clean up tasks:

  • Removed the Count column
  • Renamed “Reports” to “Shape”
  • Renamed the Query to “Shape List”

I also un-ticked “Load to Worksheet”, as you will see why shortly.

4-query-editor-shape-list

Retrieving individual sightings

The next step is to parse the individual sighting pages, I’ll do this in two steps. In the first step I’ll just take one of the sighting pages and using it perform any transformations or clean up tasks, then in the second step I will show you how to turn this into a formula which will take a shape name, which will be used to dynamically generate a URL for retrieval, this formula will then be used in the next section.

For my example I will be using the Cone shaped sightings (www.nuforc.org/webreports/ndxsCone.html).

Once again use the From Web button and enter the above URL, then in the navigator select Table 0 and then click Edit to open the Query Editor.

I performed the following clean up tasks:

  • Removed “Posted” and “Shape” columns
  • Used the Split Column transform on the “Date / Time” (delimited by Space, at the left most)
  • Renamed the new columns to “Date” and “Time”

The dates are in a format that Power Query is unable to recognize so I had to perform some extra transformation steps to reformat them:

  • Used the Split Column transform on the “Date” column (delimited by custom, /, at each occurrence). This will split out each date part.
  • Changed the data type of these new columns to Text
  • Inserted a custom column with the following formula:

Date.From(Text.Combine({[Date.3],[Date.1],[Date.2]}, “/”))

  • Removed the “Date.1”, “Date.2”, “Date.3” columns
  • Moved the Date column to be in the first position

5-cone-after-cleanup

The next step is to turn this into a reusable formula. Bring up the Advanced Editor from the View tab in the ribbon.

5a-advanced-editor

6-before-function

To turn this into a function replace the first two rows with the following code:

let
fnGetByShape = (shapeName as text) =>
let
Source = Web.Page(Web.Contents(Text.Replace(“http://www.nuforc.org/webreports/ndxs{shape}.html”, “{shape}”, shapeName))),

 

and insert the following at the end

in
fnGetByShape

These two code snippets will turn the query into a function. Click Done to close the Advanced Editor you will see that the the table of data that was there is now replaced with this:

7-now-a-function

You can test this out by clicking the Invoke button and entering in a shape name (e.g. Cone).

The last thing left to do is give the query a good name (fnGetSightingsByShape) and ensure that both the Load to Worksheet and Load to Data Model options are un-ticked.

Using the function

Now that both the list of shapes query and the function to retrieve each individual sighting are ready, it is time to combine them together and load the data into Power Pivot.

Open the Shape List query that was created earlier and add a custom column:

8-inserting-the-function

This will add a new column which calls the custom function. Expanding the Table in the new column will show the columns from the function (Date, Time, City, State, Duration, Summary). I had to set the data types for the Date and Time columns.

Now the only thing left to do is tick “Load to Data Model” then get a cuppa tea or coffee while Power Query loads the data into Power Pivot.

In Summary

In this post I’ve shown you how you can use Power Query to scrape data from a website, in addition I’ve also shown you how to build reusable functions which can be used to scrape different levels of pages on a website.

I’ve uploaded my finished demo to Dropbox for your enjoyment.

Advertisements

Scraping the Web with Excel 2013 – PowerBI Competition Entry

The Microsoft PowerBI Competition is now in full swing with the voting open to the public for the next week. (Check  out my entry).

As you can see below I just made my submission in time.

I like to cut it fine!

I like to cut it fine!

When I came to building my demo (check it out) I had a few different data sets in mind, but there were two main points that I wanted to highlight from my entry –

  1. How Power Query makes scraping webpages and extracting data into Excel really easy
  2. How Power View can be used to build visually stunning and interactive reports.

I ended up going with the National UFO Reporting Center (http://www.nuforc.org/webreports.html). They have very helpfully made their database available in plain old HTML tables, which makes it a rather trivial to retrieve using Power Query.

The biggest takeaway that I had from doing this is that doing screencasts is difficult! I have a new found appreciation for the Pluralsight authors who do an amazing job of lining up the timing of doing something on screen and narrating it.