Scraping the Web with Power Query

Last year I entered the PowerBI video demo contest. Whilst I didn’t win any prizes I did learn a fair bit from going through the exercise of putting together a screencast demo (more on that another time). In this post I’m going to walk-through the web scraping part of my demo.

The website that I choose to use for my demo was the National UFO Reporting Center (more for novelty sake then any serious interest). Below you can see the listing by shape of craft (www.nuforc.org/webreports/ndxshape.html).

1-index-by-shape

If I click on one of these links then details on each individual sighting is listed

2-sample-details

The overall goal is to have all of the individual sightings retrieved from each individual page and stored into a Power Pivot table.

One thing which makes the exercise of scraping this site easy is that the index page links to each of the individual pages in a consistent manor:

http://www.nuforc.org/webreports/ndxs{SHAPE NAME HERE}.html

This means the process of scraping will follow these steps:

  1. Retrieve the list of shapes from the index page and store these
  2. Loop through each of the shapes from step 1:
    1. Dynamically generate the URI based on the shape name
    2. Retrieve the table of sightings

Retrieving the Shape index table

From Excel select the Power Query tab, then click the From Web button.

2a-from-web

This will bring up a dialog asking for the URL to the webpage, enter the URL for the shape index: http://www.nuforc.org/webreports/ndxshape.html

Doing this used to take you directly into the Power Query editor, but now there is an extra step. In the right-hand side of Excel you will see a Navigator appear:

3-navigator

This is showing all the different elements that Power Query discovered on the web page, Table 0 contains a list of all the shapes, which you can see in the preview when you hover your mouse over it. Click Table 0 then click Edit to bring up the Query Editor.

At this point all I did was a few clean up tasks:

  • Removed the Count column
  • Renamed “Reports” to “Shape”
  • Renamed the Query to “Shape List”

I also un-ticked “Load to Worksheet”, as you will see why shortly.

4-query-editor-shape-list

Retrieving individual sightings

The next step is to parse the individual sighting pages, I’ll do this in two steps. In the first step I’ll just take one of the sighting pages and using it perform any transformations or clean up tasks, then in the second step I will show you how to turn this into a formula which will take a shape name, which will be used to dynamically generate a URL for retrieval, this formula will then be used in the next section.

For my example I will be using the Cone shaped sightings (www.nuforc.org/webreports/ndxsCone.html).

Once again use the From Web button and enter the above URL, then in the navigator select Table 0 and then click Edit to open the Query Editor.

I performed the following clean up tasks:

  • Removed “Posted” and “Shape” columns
  • Used the Split Column transform on the “Date / Time” (delimited by Space, at the left most)
  • Renamed the new columns to “Date” and “Time”

The dates are in a format that Power Query is unable to recognize so I had to perform some extra transformation steps to reformat them:

  • Used the Split Column transform on the “Date” column (delimited by custom, /, at each occurrence). This will split out each date part.
  • Changed the data type of these new columns to Text
  • Inserted a custom column with the following formula:

Date.From(Text.Combine({[Date.3],[Date.1],[Date.2]}, “/”))

  • Removed the “Date.1”, “Date.2”, “Date.3” columns
  • Moved the Date column to be in the first position

5-cone-after-cleanup

The next step is to turn this into a reusable formula. Bring up the Advanced Editor from the View tab in the ribbon.

5a-advanced-editor

6-before-function

To turn this into a function replace the first two rows with the following code:

let
fnGetByShape = (shapeName as text) =>
let
Source = Web.Page(Web.Contents(Text.Replace(“http://www.nuforc.org/webreports/ndxs{shape}.html”, “{shape}”, shapeName))),

 

and insert the following at the end

in
fnGetByShape

These two code snippets will turn the query into a function. Click Done to close the Advanced Editor you will see that the the table of data that was there is now replaced with this:

7-now-a-function

You can test this out by clicking the Invoke button and entering in a shape name (e.g. Cone).

The last thing left to do is give the query a good name (fnGetSightingsByShape) and ensure that both the Load to Worksheet and Load to Data Model options are un-ticked.

Using the function

Now that both the list of shapes query and the function to retrieve each individual sighting are ready, it is time to combine them together and load the data into Power Pivot.

Open the Shape List query that was created earlier and add a custom column:

8-inserting-the-function

This will add a new column which calls the custom function. Expanding the Table in the new column will show the columns from the function (Date, Time, City, State, Duration, Summary). I had to set the data types for the Date and Time columns.

Now the only thing left to do is tick “Load to Data Model” then get a cuppa tea or coffee while Power Query loads the data into Power Pivot.

In Summary

In this post I’ve shown you how you can use Power Query to scrape data from a website, in addition I’ve also shown you how to build reusable functions which can be used to scrape different levels of pages on a website.

I’ve uploaded my finished demo to Dropbox for your enjoyment.

Advertisements

17 thoughts on “Scraping the Web with Power Query

  1. Matt says:

    Hi, I followed your steps successfully until the ‘Using the function’ section towards the end. What does “Expanding the Table” mean and how do you do it? Basically, I would be grateful if you could separate out this section into a few baby steps like you did in previous sections. Many thanks…

    • KenR says:

      I haven’t had a chance to take some screenshots yet, but basically what you need to do is click the column header, a little icon will appear, click the icon and a little window will appear asking you to select which columns to expand.

  2. Chris says:

    Hi.
    This is great, thank you!

    I too am having trouble with that last step of putting the function in a custom column in the index. I followed your steps as close as I could; however, it shows error in each line item instead of Table. I tested my custom function by invoking it with one address, and it does work.. Any ideas?

  3. ChrisG says:

    Expression.Error: Cannot convert a value of type Record to type Text.
    Details:
    Value=Record
    Type=Type

    I too get an error on the last step and can invote the shapes individually. So i think we have the same problem

      • Nemi Podus says:

        Hi,

        I know that this is an old post but am giving it a shot anyway…
        Your guide can help me tremendously if i can figure out the error above.
        i too followed everything, and was able to invoke lists when testing individually.
        but plugging in the formula gives –

        Expression.Error: We cannot convert a value of type Table to type Function.
        Details:
        Value=Table
        Type=Type

        thank you!

  4. ChrisG says:

    I got it to work – i screwed up the clams. I got another problem though – the “unspecified” shape category has another URL then the rest. How do we take that into account? Rest of the shape URL’s has the shape type in the end – so the function works. But the Unspecified does not have a “shape” in the end of the URL:

    Unspecified URL: “http://www.nuforc.org/webreports/ndxs.html”

    Cone URL: “http://www.nuforc.org/webreports/ndxsCone.html”

    • ChrisG says:

      Found a solution. If you in custom table add this function:

      if[Shape] = “Unspecified” then fnGetSightingsByShape(“”)
      else fnGetSightingsByShape([Shape])

      Instead of just “fnGetSightingsByShape([Shape])”

  5. compania says:

    Hi,

    Thanks for the post, it helps me to understand Power Query better.
    Hi,

    Thanks for the post. It helps me understand better Power Query.

    I have an annoying error on the function part.
    After deleting the first two rows, it shows me an error :”invalid identifier” on the (“http…”) part.
    here is what I did.

    let
    fnGetByShape = (shapeName as text) =>
    let
    Source = Web.Page(Web.Contents(Text.Replace(“http://www.nuforc.org/webreports/ndxs{shape}.html”, “{shape}”, shapeName))),
    Data0 = Source{0}[Data],

    what seems to be the problem ?

      • compania says:

        Sure

        let
        fnGetByShape = (shapeName as text) =>
        let
        Source = Web.Page(Web.Contents(Text.Replace(“http://www.nuforc.org/webreports/ndxs{shape}.html”, “{shape}”, shapeName))),
        #”Changed Type” = Table.TransformColumnTypes(Data0,{{“Date / Time”, type datetime}, {“City”, type text}, {“State”, type text}, {“Shape”, type text}, {“Duration”, type text}, {“Summary”, type text}, {“Posted”, type date}}),
        #”Removed Columns” = Table.RemoveColumns(#”Changed Type”,{“Shape”, “Posted”}),
        #”Split Column by Delimiter” = Table.SplitColumn(Table.TransformColumnTypes(#”Removed Columns”, {{“Date / Time”, type text}}, “en-US”),”Date / Time”,Splitter.SplitTextByEachDelimiter({” “}, QuoteStyle.Csv, false),{“Date / Time.1”, “Date / Time.2″}),
        #”Changed Type1″ = Table.TransformColumnTypes(#”Split Column by Delimiter”,{{“Date / Time.1”, type date}, {“Date / Time.2″, type time}}),
        #”Renamed Columns” = Table.RenameColumns(#”Changed Type1″,{{“Date / Time.1”, “Date”}, {“Date / Time.2”, “Time”}}),
        #”Split Column by Delimiter1″ = Table.SplitColumn(Table.TransformColumnTypes(#”Renamed Columns”, {{“Date”, type text}}, “en-US”),”Date”,Splitter.SplitTextByDelimiter(“/”, QuoteStyle.Csv),{“Date.1”, “Date.2”, “Date.3″}),
        #”Changed Type2″ = Table.TransformColumnTypes(#”Split Column by Delimiter1”,{{“Date.1”, type text}, {“Date.2”, type text}, {“Date.3″, type text}}),
        #”Added Custom” = Table.AddColumn(#”Changed Type2″, “Date”, each Date.From(Text.Combine({[Date.3],[Date.1],[Date.2]}, “/”))),
        #”Reordered Columns” = Table.ReorderColumns(#”Added Custom”,{“Date”, “Date.1”, “Date.2”, “Date.3”, “Time”, “City”, “State”, “Duration”, “Summary”}),
        #”Removed Columns1″ = Table.RemoveColumns(#”Reordered Columns”,{“Date.1”, “Date.2”, “Date.3”})
        in
        fnGetByShape

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s