SQL Server Integration Services – An Updated HDFS Task

This is an update to my previous post about loading data into HDFS. After using the component a few times I realized that having to pass in a list of files to upload seemed a bit odd inside of a Data Flow task.

So instead I have changed the component to be an SSIS Task, instead of a Destination. This means that it is used in the Control Flow of a package, instead of in a Data Flow task.

I have also made a few other changes:

  • Added a file type filter
  • Improved the UI design
  • Added the ability to create a new HDFS Connection Manager
  • Added a UI to the HDFS Connection Manager

This is what the component now looks like:

hdfs-task-ui

The File Type Filter allows you to specify what types of files should be uploaded from the source directory specified. This is useful if you have a mixture of files or only want to upload a subset.

The update has been pushed to GitHub – https://github.com/kzhen/SSISHDFS.

Advertisements

Analysing Australian Parliamentarians’ Expenditure with PowerQuery

Currently in Australia there has been some news reports of Federal MPs abusing their expenses….

So I thought I would take a look at what publicly available data there is and then use the Microsoft Excel Power* (PowerQuery, PowerPivot, PowerView) tools to do some analysis.

This post is about using PowerQuery to pull data from the Department of Finance website into Excel and get it into shape to be used with PowerPivot.

I will be using the Department of Finance website, here, if you browse to this page you will see that there is a series of HTML tables that contain the expenditure broken down by type for each minister.

Using PowerQuery it is very easy to download this data, transform and shape it nicely and then have it automagically automatically appear in Excel.

So here is how I did it using Excel 2013.

First I selected the PowerQuery tab in the Excel ribbon, then clicked on the “From Web” button. This will bring up a dialog box asking for a URL. I started with the link above and pasted that in.

1-excel-ribbon-url

That loads up the PowerQuery Query Editor, this is where all of the transformation happens.

2-query-editor

Things to note about this screen:

  • Down the left-hand side you have the Navigator – this shows all the HTML tables that PowerQuery “discovered”
  • In the middle of the screen you have the data that will be loaded into Excel when you press the “Done” button (currently blank)
  • On the right-hand side you have the steps that have been performed (this is expanded using the little arrow, there is also a handy little count of how many steps have been performed)

There were a number of steps that I went through to get the data ready for analysis.

The first was setting up the source, clicking “New South Wales” in the Navigator section will then display in the main grid all the data for ministers from “New South Wales”, but I want to combine that with ministers from ALL states and territories.

To combine multiple tables together:

  • Click on New South Wales
  • Expand the Steps
  • Click on Source – notice that the formula bar changes to say

= Source{0}[Data]

  • This can be replaced with

= Source{0}[Data]&Source{1}[Data]&Source{2}[Data]&Source{3}[Data]&Source{4}[Data]&Source{5}[Data]&Source{6}[Data]&Source{7}[Data]

After performing this step, I click the Refresh button, do some scrolling and I can see that ministers from Melbourne, Brisbane and cities in states other than New South Wales are now in the grid

3-combined-html-tables

Right, so now that the source has been setup it’s time for the second stage – the transformation stage.

Looking through the data you can see that the columns are all strings, I want the ones containing monetary values to be numeric so that later on I can SUM them, average them etc.

There are a number of steps to achieve this:

  • Getting rid of all the $ symbols, I achieved this by selecting all the columns then right-clicking on one of the headers and select the “Replace Values” option. In here I’m specifying $ as the search value and leaving the replacement value empty.
  • Now to convert them into a number – once again, select all the columns, right click on a heading then select Change Type and then Number

The next part of the transformation stage is to do some aesthetic cleansing, this is renaming the column headings. I just went through and renamed each one, removing the number and the extra whitespace.

As this set of data is going to be combined with data from other periods, I added a column to identify what period this data is from. Adding a new column is done by right-clicking the header row then Insert Column and Custom. I named the column Period and gave it the value “1 January to 30 June 2012”.

4-completed-transformation

That completes the transformation stage and all the work that needs to be done in the Query Editor.

Clicking Done brings all the data into a new Excel sheet.

5-data-in-excel

Now you might be thinking, well I could have just copied and pasted the data from the webpage straight into Excel and cut out this messing about, and while that is true for a lot of data from the web it misses out one of the key features of PowerQuery.

If you look closely at the screenshot of Excel above, you will notice on the right-hand side the Query Settings, in it there is a “Last updated” time and a “Refresh” button. When the refresh button is clicked, PowerQuery will download the HTML document again, and process it using the steps defined above, removing the need for any manual cleanup.

It’s hassle free 🙂 and that is one of the reasons I like it.

In the next post I’m going to show how to pull in the data from the other periods so that in a future post I can do historical comparisons on expenditure.

(You can download the fully working spreadsheet from my dropbox)

SQLRally Nordic – What I’m Looking Forward To

This year I have been fortunate enough to have been able to attend a few different SQL Server community events.

Back in May I attended SQLBits XI, which was a fantastic experience and opened my eyes to the #SQLFamily. I learnt a lot about SQL Server, but I think what I really learnt is that there is an amazing online community for SQL Server professionals.

Then a couple of weeks ago I was at #sqlsat228 in Cambridge, where I attended a Friday pre-con on Big Data with Jen Stirrup and Allan Mitchell followed by a full day of sessions on the Saturday.

At the beginning of November I’m going to Stockholm to attend SQL Rally Nordic. So this post is about which sessions I’m planning to attend.

Pre-conference Seminar

Monday: Data Warehouse Modeling – Making the Right Choices

I’ve learnt a lot about data modelling especially over the past year and so this session is coming at a good time.

It sounds like it is going to be a full on day. I’m planning on writing at least one or two blog posts to solidify what I learn, so stay tuned.

Conference Sessions

Tuesday

Time Track Session
1030-1130 BI Optimizing Data Models in Tabular & PowerPivot
1145-1245 BI 10 Tips and Tricks for Better SSIS Performance I attended this talk at SQLSaturday Cambridge
1145-1245 DEV Powerful T-SQL Improvements that Reduce Query Complexity
1400-1500 DBA Getting the Most from Your SAN – File and Filegroup layout
1515-1615 BI Optimizing DAX Queries
1645-1745 DBA Index Impact Workload Analysis

Wednesday

Time Track Session
0900-1000 BI Analysing Data with PowerView
1015-1115 BI Data Warehouse in the Cloud – Marketing or Reality?
1015-1115 DBA The Day After Tomorrow: Why You Need a Baseline
1145-1245 DEV Cloud Optimizing Your Database App
1400-1500 BI Use the Power of Analytical Hierarchies in Your Cubes
1515-1615 DBA Architecting SQL Server HA and DR solutions on Windows Azure

As you can see, I’m planning to attend a number of sessions that are directly or indirectly related to SSAS Tabular model (DAX, PowerPivot or PowerView sessions).

Of late I have been using PowerPivot to build quick proof-of-concept BI models to get people’s attention. What I really like about PowerPivot is how quick and easy it is to build a data model, write some simple DAX formulas then use PowerView to show people the story of the data.

As well as attending some quality sessions, I’m also really looking forward to catching up with new and old friends.

So I hope to see you there 🙂

Load Files into HDFS using SQL Server Integration Services

UPDATE: I’ve made a few changes to how the component works –

Read about it here.

Recently I have been playing around with the Hortonworks Big Data sandbox, the tutorials were very good, and made it easy to get going with some sample data.

Given that I mainly use the Microsoft BI stack I was hoping for the ability to use SSIS to load my data, especially as it would be nice down the line to do this as part of an ETL process. However, Microsoft as yet don’t have first class support for HDFS as a destination.

So to fix this problem I decided to build a custom destination for SSIS that would load data into HDFS.

My goal was to be able to transfer a bunch of files from my filesystem into HDFS.

This is my directory of files

1-source-directory

The way my custom destination currently works is by sending a list of filenames into it, these files will then be transferred into HDFS. Here is my source file with a list of files to upload.

2-reference-file

Next up is the SSIS package. As you can see below, the package is very simple. There is a flat file source, which I’ve setup to read from the source file above, it then feeds into the HDFS Destination.

The HDFS Destination has three parts to configure:

  • Configuration Manager
  • Destination Path – this will be the path in HDFS to upload the files to (I had to change the permissions so all groups had write access)
  • Source column – this is the column from the flat file source to read the filenames from

4-ssdt-package

The HDFS Configuration Manager is setup with a hostname and port, I’ve set mine to “sandbox” and 50070. Where sandbox is setup in my hosts file to point to my Hortonworks sandbox (192.168.56.101).

4b-hdfs-connection-manager

After all this has been configured the SSIS package can be run.

5b-ssdt-after-running

Now, to check HDFS

6-hdfs-with-files

All the files have been uploaded into HDFS.

The custom component uses the Microsoft Hadoop Webclient, which is part of the Microsoft Hadoop .Net SDK. The code is available on GitHub.

That wraps up the first incarnation of my HDFS Destination. I’m planning to further extend it, so that rather than sending in a list of filenames to upload you will be able to send in string data which will then be transferred as files into HDFS.