Scraping the Web with Power Query

Last year I entered the PowerBI video demo contest. Whilst I didn’t win any prizes I did learn a fair bit from going through the exercise of putting together a screencast demo (more on that another time). In this post I’m going to walk-through the web scraping part of my demo.

The website that I choose to use for my demo was the National UFO Reporting Center (more for novelty sake then any serious interest). Below you can see the listing by shape of craft (


If I click on one of these links then details on each individual sighting is listed


The overall goal is to have all of the individual sightings retrieved from each individual page and stored into a Power Pivot table.

One thing which makes the exercise of scraping this site easy is that the index page links to each of the individual pages in a consistent manor:{SHAPE NAME HERE}.html

This means the process of scraping will follow these steps:

  1. Retrieve the list of shapes from the index page and store these
  2. Loop through each of the shapes from step 1:
    1. Dynamically generate the URI based on the shape name
    2. Retrieve the table of sightings

Retrieving the Shape index table

From Excel select the Power Query tab, then click the From Web button.


This will bring up a dialog asking for the URL to the webpage, enter the URL for the shape index:

Doing this used to take you directly into the Power Query editor, but now there is an extra step. In the right-hand side of Excel you will see a Navigator appear:


This is showing all the different elements that Power Query discovered on the web page, Table 0 contains a list of all the shapes, which you can see in the preview when you hover your mouse over it. Click Table 0 then click Edit to bring up the Query Editor.

At this point all I did was a few clean up tasks:

  • Removed the Count column
  • Renamed “Reports” to “Shape”
  • Renamed the Query to “Shape List”

I also un-ticked “Load to Worksheet”, as you will see why shortly.


Retrieving individual sightings

The next step is to parse the individual sighting pages, I’ll do this in two steps. In the first step I’ll just take one of the sighting pages and using it perform any transformations or clean up tasks, then in the second step I will show you how to turn this into a formula which will take a shape name, which will be used to dynamically generate a URL for retrieval, this formula will then be used in the next section.

For my example I will be using the Cone shaped sightings (

Once again use the From Web button and enter the above URL, then in the navigator select Table 0 and then click Edit to open the Query Editor.

I performed the following clean up tasks:

  • Removed “Posted” and “Shape” columns
  • Used the Split Column transform on the “Date / Time” (delimited by Space, at the left most)
  • Renamed the new columns to “Date” and “Time”

The dates are in a format that Power Query is unable to recognize so I had to perform some extra transformation steps to reformat them:

  • Used the Split Column transform on the “Date” column (delimited by custom, /, at each occurrence). This will split out each date part.
  • Changed the data type of these new columns to Text
  • Inserted a custom column with the following formula:

Date.From(Text.Combine({[Date.3],[Date.1],[Date.2]}, “/”))

  • Removed the “Date.1”, “Date.2”, “Date.3” columns
  • Moved the Date column to be in the first position


The next step is to turn this into a reusable formula. Bring up the Advanced Editor from the View tab in the ribbon.



To turn this into a function replace the first two rows with the following code:

fnGetByShape = (shapeName as text) =>
Source = Web.Page(Web.Contents(Text.Replace(“{shape}.html”, “{shape}”, shapeName))),


and insert the following at the end


These two code snippets will turn the query into a function. Click Done to close the Advanced Editor you will see that the the table of data that was there is now replaced with this:


You can test this out by clicking the Invoke button and entering in a shape name (e.g. Cone).

The last thing left to do is give the query a good name (fnGetSightingsByShape) and ensure that both the Load to Worksheet and Load to Data Model options are un-ticked.

Using the function

Now that both the list of shapes query and the function to retrieve each individual sighting are ready, it is time to combine them together and load the data into Power Pivot.

Open the Shape List query that was created earlier and add a custom column:


This will add a new column which calls the custom function. Expanding the Table in the new column will show the columns from the function (Date, Time, City, State, Duration, Summary). I had to set the data types for the Date and Time columns.

Now the only thing left to do is tick “Load to Data Model” then get a cuppa tea or coffee while Power Query loads the data into Power Pivot.

In Summary

In this post I’ve shown you how you can use Power Query to scrape data from a website, in addition I’ve also shown you how to build reusable functions which can be used to scrape different levels of pages on a website.

I’ve uploaded my finished demo to Dropbox for your enjoyment.

Debugging Custom SSIS Components

In the past I have written extensively about how to build custom components for SQL Server Integration Services, these posts have always been focused on the ‘happy path’, if you’re not familiar with this phrase it refers to the path through your application that works exactly as expected. Often times in development we have to deal with the sad path, or when things aren’t working as we would like or expect.

In this post I will show two techniques that can be used to help with debugging custom SSIS components.

Attaching the Visual Studio Debugger

You will need to have Visual Studio open with the source code for the custom component and have SSDT open with the package that uses the custom component.

From Visual Studio select the Debug menu then Attach to Process


This will bring up the following window showing a list of available processes for debugging:


In the list of available processes select the instance of devenv.exe that corresponds to your SSIS solution, then click Attach. Once attached any breakpoints that you define will be hit, from where you can debug your component.

This method works just fine for when you want to debug the setup and configuration of your package, that is, anything the users of your component will be doing prior to executing the package.

The disadvantage of this method is that once you actually start executing the package the debugger will detach itself. The reason for this is because execution of the package happens within a different process. Continue reading for a better method.

Launching the Debugger from in Code

Another way to automatically start debugging is to add the following line of code:


Into the particular method that you want to debug or into the constructor.


Here you can see I’ve added the Debugger.Launch() code into the constructor for my HDFS Task Component, Now when I open a package in SSDT that uses this component, or if I drag a new instance of the task into an existing package the following window will appear:


Here I can select how I want to debug the application. During development of custom components I always have an instance of visual studio open with my component solution and you can see it listed as the third option in the image above. By selecting this option and clicking Yes I will be taken back to that instance of Visual Studio which will then attach itself to my instance of SSDT. It will then break execution.


At this point I can set any breakpoints where I want to break future execution. From here press F5 to continue. Visual Studio will remain attached to SSDT and any breakpoints you’ve defined will interrupt execution and bring you back here.

The disadvantage of this method is that the “Do you want to debug this application” window will keep appearing at various points during package design unless you attach the debugger.

The advantage of using this method is that once you actually start executing your package you will again be prompted to attach the debugger, which means you will be able to debug the component during package execution.

Make sure you either remove this line of code before shipping it for actual use or wrap it around the #if DEBUG / #endif compiler directives as shown above and remember to use the Release configuration for production use!

Apache Flume – Get logs out of RabbitMQ and into HDFS

This post is an extension of Tutorial 12 from Hortonworks (original here), which shows how to use Apache Flume to consume entries from a log file and put them into HDFS.

One of the problems that I see with the Hortonworks sandbox tutorials (and don’t get me wrong, I think they are great) is the assumption that you already have data loaded into your cluster, or they demonstrate an unrealistic way of loading data into your cluster – uploading a csv file through your web browser. One of the exceptions to this is tutorial 12, which shows how to use Apache Flume to monitor a log file and insert the contents into HDFS.

In this post I’m going to further extend the original tutorial to show how to use Apache Flume to read log entries from a RabbitMQ queue.

Apache Flume is described by the folk at Hortonworks as:

Apache™ Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.

…continue reading about Apache Flume over on the Hortonworks website.


In this article I will cover off the following:

  • Installation and Configuration of Flume
  • Generating fake server logs into RabbitMQ

To follow along you will need to:

Installing Flume

WIth the Sandbox up and running, press Alt and F5 to bring up the login screen. You can login using the default credentials:

login: root
password: hadoop

After you’ve logged in type:

yum install -y flume

You should now see the installation progressing until it says Complete!

For more details on installation take a look at Tutorial 12 from Hortonworks.

Using the flume.conf file that is part of my tutorial files, follow the instructions to upload it into the sandbox from the tutorial. Before uploading the file, you should check that the RabbitMQ configuration matches your system:

sandbox.sources.rabbitmq_source1.hostname =
sandbox.sources.rabbitmq_source1.queuename = logs
sandbox.sources.rabbitmq_source1.username = guest
sandbox.sources.rabbitmq_source1.password = guest
sandbox.sources.rabbitmq_source1.port = 5672
sandbox.sources.rabbitmq_source1.virtualhost = logs

You shouldn’t need to change anything else.

For Flume to be able to consume from a RabbitMQ queue I created a new plugins directory and then upload the Flume-ng RabbitMQ library.

Creating the required directories can be done from the Sandbox console with the following command:

mkdir /usr/lib/flume/plugins.d
mkdir /usr/lib/flume/plugins.d/flume-rabbitmq
mkdir /usr/lib/flume/plugins.d/flume-rabbitmq/lib

Once these directories have been created, upload the flume-rabbitmq-channel-1.0-SNAPSHOT.jar file into the lib directory.

Starting Flume

From the Sandbox console, execute the following command

flume-ng agent -c /etc/flume/conf -f /etc/flume/conf/flume.conf -n sandbox

Generate server logs into RabbitMQ

To generate log entries I took the original python script (which appended entries to the end of a log file), and modified it to publish log entries to RabbitMQ.

To run the python script you will need to follow the instructions on the RabbitMQ site to install the pika client library (see details on the RabbitMQ website).

The script is setup to connect to a broker on the localhost into a virtual host called “logs”. You will need to make sure that the virtual host exists.

You can start the script by running:

python.exe c:\path\to\

When this is started the script will declare an exchange and queue and then start publishing log entries.

You can see that everything is running by going over the RabbitMQ Management console.

3-flume-consumingThe incoming messages are the ones being generated by the Python script and the deliver / get and ack messages are the ones being consumed by Flume.

Setting up HCatalog

The following command (from the original tutorial) can be used to create the HCatalog table (make sure you only enter it only on a single line):

hcat -e “CREATE TABLE firewall_logs (time STRING, ip STRING, country STRING, status STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|’ LOCATION ‘/flume/rab_events’;”

You should now be able to browse this table from the web interface.

To do some analysis on this data you can now follow steps 5 and 6 from the original tutorial.