Load Files into HDFS using SQL Server Integration Services

UPDATE: I’ve made a few changes to how the component works –

Read about it here.

Recently I have been playing around with the Hortonworks Big Data sandbox, the tutorials were very good, and made it easy to get going with some sample data.

Given that I mainly use the Microsoft BI stack I was hoping for the ability to use SSIS to load my data, especially as it would be nice down the line to do this as part of an ETL process. However, Microsoft as yet don’t have first class support for HDFS as a destination.

So to fix this problem I decided to build a custom destination for SSIS that would load data into HDFS.

My goal was to be able to transfer a bunch of files from my filesystem into HDFS.

This is my directory of files

1-source-directory

The way my custom destination currently works is by sending a list of filenames into it, these files will then be transferred into HDFS. Here is my source file with a list of files to upload.

2-reference-file

Next up is the SSIS package. As you can see below, the package is very simple. There is a flat file source, which I’ve setup to read from the source file above, it then feeds into the HDFS Destination.

The HDFS Destination has three parts to configure:

  • Configuration Manager
  • Destination Path – this will be the path in HDFS to upload the files to (I had to change the permissions so all groups had write access)
  • Source column – this is the column from the flat file source to read the filenames from

4-ssdt-package

The HDFS Configuration Manager is setup with a hostname and port, I’ve set mine to “sandbox” and 50070. Where sandbox is setup in my hosts file to point to my Hortonworks sandbox (192.168.56.101).

4b-hdfs-connection-manager

After all this has been configured the SSIS package can be run.

5b-ssdt-after-running

Now, to check HDFS

6-hdfs-with-files

All the files have been uploaded into HDFS.

The custom component uses the Microsoft Hadoop Webclient, which is part of the Microsoft Hadoop .Net SDK. The code is available on GitHub.

That wraps up the first incarnation of my HDFS Destination. I’m planning to further extend it, so that rather than sending in a list of filenames to upload you will be able to send in string data which will then be transferred as files into HDFS.

Advertisements

11 thoughts on “Load Files into HDFS using SQL Server Integration Services

  1. workmama2319 says:

    Ken, thanks for your time creating this post! I am trying to do the same thing using the Hortonworks Sandbox 2.0 and I can’t figure out how to connect to my sandbox. I am new to Hadoop and I am trying to upload some files into HDFS.

    Your example shows your connection string as “http://sandbox:9001” using port 50070. Is port 9001 a default port for the file system? What file are you editing in the sandbox to set your host name to “sandbox”? Can you expand a little on your sandbox VM settings to help me understand what I am missing?

    Thanks!! Matt.

    • KenR says:

      Hi Matt,

      Thanks for pointing out the inconsistency of the ports. I have just done a quick test and the port that I used was 50070.

      The hosts file that I refer to is the one on your local windows system (c:\windows\system32\drivers\etc\hosts). In mine I’ve added the following entry:

      192.168.56.101 sandbox

      Where 192.168.56.101 is the IP address for my sandbox, for information on setting up VirtualBox and the Hortonworks sandbox see my previous post (https://kzhendev.wordpress.com/2013/08/28/getting-started-with-the-hortonworks-sandbox/). This is for version 1.3 but I don’t think much should have changed in terms of configuration in version 2.0. If you have any more problems/questions then please let me know so I can help.

      Ken

      • workmama2319 says:

        Hi Ken, thanks for your response! The extra information helped me out a lot. The short answer that solved my issue was that I needed to use the fully qualified name “sandbox.hortonworks.com” instead of just “sandbox”. I am running my VM at work and since I am on the internal network our DNS server couldn’t find “sandbox”, but it could find “sandbox.hortonworks.com”.

        I modified my previous entry in my hosts file on my windows box (c:\windows\system32\drivers\etc\hosts) to the following and the files started to transfer as per your article:

        192.168.123.101 sandbox.hortonworks.com

        One last thing that we did to help figure out what was going on. On my sandbox VM, I ran the command “netstat -aN | grep 50070” to see what was listening on that port. From the results, I was able to see what has in LISTEN mode and looking at the name of the host I had the “ah-ha!” moment when I saw the fully qualified name!!

        Here is the sample of the output

        “tcp 0 0 sandbox.hortonworks.c:50070 *:* LISTEN”

        Thanks again for your help figuring this out!!

        Matt

  2. santhosh says:

    Hi Ken

    The HDFS destination is giving the validation error in SSIS data flow task

    Error:
    “Data Flow Task: System.IO.FileNotFoundException: Could not load file or assembly ‘Microsoft.Hadoop.WebClient, Version=0.9.4951.25594, Culture=neutral, PublicKeyToken=31bf3856ad364e35’ or one of its dependencies. The system cannot find the file specified. File name: ‘Microsoft.Hadoop.WebClient, Version=0.9.4951.25594, Culture=neutral, PublicKeyToken=31bf3856ad364e35’ at SSISHDFS.HDFSDestination.HDFSDestination.AcquireConnections(Object transaction)”

    I think it’s dependencies problem. Please provide the information for how to resole above error.

    • KenR says:

      Hi, Try installing Microsoft.Hadoop.WebClient into the GAC using gacutil.exe, you might find that you need to install a few dependencies. Please let me know which ones you had to install and I will update the post.

      • santhosh says:

        Hi ken

        Thanks for your response.
        I installed all dependency dll’s into GAC now above issue resolved, but when i execute the SSIS Data Flow Task i am getting another error below

        TITLE: Microsoft Visual Studio
        ——————————
        Exception deserializing the package “The package failed to load due to error 0xC0011008 “Error loading from XML. No further detailed error information can be specified for this problem because no Events object was passed where detailed error information can be stored.”. This occurs when CPackage::LoadFromXML fails.
        “.
        ADDITIONAL INFORMATION:
        The package failed to load due to error 0xC0011008 “Error loading from XML. No further detailed error information can be specified for this problem because no Events object was passed where detailed error information can be stored.”. This occurs when CPackage::LoadFromXML fails.

        Please could you provide the solution, how to resolve the above error.

        santhosh

      • santhosh says:

        Thanks for your response,

        As suggested I have created a new package and in a new SSIS solution, but still i am getting the same error.

  3. KenR says:

    Can you create a new package and try the component again? I remember encountering this problem a few times during development and think that fixed the problem…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s