Getting Started with the Hortonworks Sandbox

In my previous post, I made reference to the Twitter Big Data example for Microsoft StreamInsight (project page).

The sample collects tweets in real-time from Twitter then does a few things:

  • Displays current information about trends in real-time on a web dashboard
  • Stores information about the tweets into a SQL Azure database
  • Store the actual tweets into an Azure Blob Store

Then there are some commands that you can use with Azure HDInsight to do some post-processing, this is great if you have access to the HDInsight Preview, but what if you are stuck on the waiting list? That’s where the Hortonworks Sandbox comes in!

In this post, I’m going to give an overview of getting the sandbox setup and then how to move the data collected by StreamInsight from Azure into the Sandbox. In the next post I will be showing how to do the analysis.

Getting Started with the Sandbox

You will need to download:

Once VirtualBox has been installed, ensure that there is a host-only network configured.

To do this, go to File -> Preferences then click “Network”. You should see an entry in the Host-only Networks:


Click on the screw-driver icon to edit the entry, and ensure that the adapter settings match the following:



Once the network is correctly configured, click OK until all the windows are closed.

The next step is to import the sandbox application. From the VirtualBox Manager, click File -> Import Applicance, this will bring up the import wizard.

On the first page click the Open appliance button, and browse to where you downloaded the sandbox. Click Next and you will see the “Appliance settings” page, you shouldn’t have to change any details.

Click Import, you will see a progress window appear. Once it has completed, you will see it in the list of servers:


Configuring the Sandbox

The next step is to configure the network adapters.

Right click on the Hortonworks Sandbox, and click Settings. Then click Network to bring up the Network Configuration.

I configured my adapters like this:

adapter-1 adapter-2

The Sandbox is now configured.

You can now boot the sandbox.
Right click in the VirtualBox Manager and click Start. This will boot up the Sandbox, after a minute or two you will see the following:


This is where I was caught out, from looking at this window you don’t know what IP address you should use to browse to the Sandbox. By configuring the host-only network, you can browse to the sandbox on

You will be asked to fill in some details about yourself then you’re good to go.

Adding the SQL Server driver for Apache Sqoop

Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.

In the Twitter Big Data sample, SQL Azure is used to store information about the Tweets that it collects. The Sandbox when first setup is missing the driver for SQL Server, to add it follow these steps:
  1. Find the JDBC drivers for MS SQL on Microsoft, currently sqljdbc_4.0.2206.100_enu.tar.gz
  2. Using Putty, SSH onto the sandbox and logon as root with password hadoop
  3. Issue the following command to download the driver: wget
  4. Unzip using this command: gunzip sqljdbc_4.0.2206.100_enu.tar.gz
  5. Extract tar: tar –xvf sqljdbc_4.0.2206.100_enu.tar
  6. Copy to the Sqoop lib directory: cp sqljdbc_4.0/enu/sqljdbc4.jar /usr/lib/sqoop/lib/

This now covers off all the configuration and setup of the sandbox.

Setting up the Twitter Big Data Solution

The Twitter Big Data solution can be downloaded from the codeplex site project site, by clicking on Source Code, then Download.


You will need to have StreamInsight installed and running on your development machine, installation instructions can be found here.

The Twitter Big Data solution comes with a README file which outlines various steps that you need to go through to configure the solution. You will need to fill in the app.config file with details of your SQL Azure database and Blob store. Once that is done you can run the solution and you should see data in the web dashboard, at the same time records will be going into your blob store and SQL Azure database.

Once there is some data in the SQL Azure database and the Blob Store, we can use various components of the Hortonworks Sandbox to pull data from Azure into Hadoop.

The remainder of this post is basically going to be a re-write of the instructions found in the README file of the solution.

Moving data from SQL Azure to Hadoop using Sqoop

Sqoop is a component in the sandbox which connects to different data sources and moves it into HDFS (hadoop distributed file system).

I wasn’t able to get Sqoop working with SQL Azure via the nice UI provided by the sandbox, so instead used SSH. So you will need to use Putty or any other SSH client to complete this stage.

  • Connect to the sandbox via SSH, logon using user: root, password: hadoop
  • Execute the following command:

sqoop import –connect “jdbc:sqlserver://;username=your_sql_azure_username_here@sql_azure_server_name_here;password=your_sql_azure_password_here;database=your_db_name_here” –table TweetInfo –hive-import -hive-overwrite

Sqoop will now start moving data from your SQL Azure database into the sandbox.  This shouldn’t take more than a few minutes, depending on how much data you have in the SQL Azure database and your connection speed.

Moving data from Azure Blob Storage into HDFS

The nice thing about HDInsight in Azure is that it can be configured to use Azure Storage Vault, which basically means you can map the file system of your hadoop system to Azure blob storage. Unfortunately, there is no straightforward way that I could find (if I’m wrong please let me know!) to connect the sandbox to Azure blob storage.

In the README file, the author has you map Hadoop on Azure (HDInsight) to Azure Storage Vault (ASV), because I couldn’t figure out a good way to do this, what I ended up doing was a bit clunky.

I used the Azure Storage Explorer to download all the blobs from the container, which I specified in the app.config, into a directory. I then zipped all of these files and then uploaded this zip file into the Sandbox.

Uploading the zip file to the Sandbox is very easy. With the Sandbox open in your web browser, click the File Browser icon:


Then click Upload -> Zip File:


From here select the zip file with all your tweets. This will upload the file, then extract all the files into a directory of the same name as the zip file.

Processing the Tweets using Hive

There is a script included in the solution called “Analyze Tweets with Hive.txt”. I have made some small modifications to it, which you can download here. If you compare this file to the original you will notice that the first two configuration lines have been removed as well as some of the lines which were commented out. Also, the location for the raw_tweets table has been updated.

To process this script using the Sanbox, in your browser:

  • Click on the Beeswax icon, this is the Hive UI
  • In the Settings section, add the following settings:
    • key: hive.exec.dynamic.partition value: true
    • key: hive.exec.dynamic.partition.mode value: nonstrict
  • Open the modified script, and copy and paste the contents into the query editor

Click Execute to start the script

This will then start processing the raw tweets that were stored in the Azure blob storage, which we moved into HDFS.

This will take longer to execute depending on how many tweets were collected. Unfortunately you can’t leave this unattended, as you will see the following come up:


You will need to click Next to continue processing.

Installing the Hive ODBC Driver

To analyse the tweets using Excel you will need to install the ODBC Driver from Hortonworks, which can be found here.

You will need to make sure you install the version that corresponds to the version of Excel you have installed (32bit / 64bit).

After you have installed the driver, open Administrative Tools. Then open ODBC Data Sources 32 bit / 64 bit.

Click System DNS tab, and you should see:


Click on Configure, here you will need to fill in the host and user name fields:


At this point you now have all the Tweet information that was generated by the StreamInsight application in the Sandbox, you have the ODBC Driver configured and setup.

In the next post I will show how to analyse this information.

Real-Time Analytics for Twilio

Recently I have been looking into complex event processing, what the use cases for it are and the kind of insights you can get from it.

As I am a .Net developer, I thought it best to start with the Microsoft stack, so I have been experimenting with StreamInsight. There are a few examples floating around, but the one that really inspired me was the Big Data Twitter Demo (link).

After seeing the Twitter demo I jumped into gear and built a real-time dashboard for Twilio.

For those who are not familiar with Twilio, they are a cloud based telephony company that sells telephone numbers which are powered by REST APIs. This allows developers to build complex and dynamic telephony applications (more details).

So what did I build, and why did I bother?

You can see from the screenshot below that it is a very simple real-time dashboard.


The graph shows:

  • In blue – total number of calls that started in a 5 second window
  • In black – total number of calls that ended in a 5 second window

All of this data is produced by making a telephone call to my Twilio phone number.

(If you add up the numbers then you will see that the start and end aren’t equal, this is because at the time of the screen shot a number of calls were still in progress).

The system is not complicated, but there are a number of different parts to it:

  • ASP.Net WebAPI – This is the endpoint that Twilio fires requests to when it received a phone call to my telephone number
  • RabbitMQ – When a request is made from Twilio to the WebAPI, the WebAPI will send a “start” or “end” message to RabbitMQ
  • StreamInsight application – This is the glue between the events received and the output sent to the client. It listens to the RabbitMQ queue and then with the events received does the fancy “complex event processing”
  • ASP.Net MVC Website – Which uses SignalR (a realtime framework for communication between a web client and server) to update the graph as events are processed

Looking at this stack you’re probably thinking,

“that is way too complicated to get a simple graph on a webpage!”

And you’d be right if all I wanted was a simple graph, I could just have my WebAPI broadcast messages directly to SignalR and cut out everything in between.

What StreamInsight Adds

If all I wanted was a graph showing the number of calls starting in a given window, I could probably build up the logic to do that with a bit of effort myself. However, what StreamInsight is designed for is complex event processing.

So to collect the number of events in a 5 second window all I have to do is write this query:

var callsByType = from call in rabbitMQInput
				  group call by call.EventType into groups
				  from win in groups.TumblingWindow(TimeSpan.FromSeconds(5))
				  select new CallSummary
					TotalCalls = win.Count(),
					EventType = groups.Key

To get it working end to end requires a few more components, but what I’m trying to show here is how simple it is to write queries. In later posts I’m going to explain how to write a StreamInsight application.

The example above is very trivial, so imagine that rather then just a single Twilio number you are building a system for a large telecoms company that has hundreds or thousands of phone numbers, geographically dispersed, and each number is tied to a different department and you wanted to see which departments were receiving the largest volume of calls.

Or, take another more complicated example, you want to correlate diagnostic events, against the number of failed calls since an outage.

It would be very difficult (neigh on impossible) to write this kind of logic inside a Web API controller method.

With StreamInsight it is very easy, you can do all kinds of aggregations on events. You can also join multiple event streams together. This is where the power of StreamInsight and complex event processing is.

Now I hope you’re beginning to see the necessity of all the pieces of the stack I mentioned above.

The advantage of using RabbitMQ as the input into StreamInsight means you could have events from other systems (e.g. diagnostics) all going into one system and it can all be processed by StreamInsight to find patterns and alert in real time to engineers.