# Measuring Application Performance with Mini Profiler and Splunk

At DrDoctor we use New Relic for performance monitoring of our web applications. However web only makes up a small part of our overall system. We make extensive use of a messaging based system and have lots of micro-services which communicate over a messaging bus.

After adding New Relic to our Windows services we soon realized that they don’t really do a good job of supporting them. The main reason for this afar as I can see from reading their docs is that New Relic depends on an active HttpContext for profiling applications.

Having used New Relic to pinpoint the performance problems in our web app, I started to think about what it would take to build something similar but for a Windows Service. This blog post is showing off a proof-of-concept that I’m currently developing.

### Introducing Mini Profiler

Mini Profiler is a lightweight c# library written by the folk at StackExchange. It was designed to show performance metrics for web applications, but is flexible enough that I was able to plumb it into our windows services. Doing this is pretty easy:

MiniProfiler.Settings.ProfilerProvider = new StackExchange.Profiling.SingletonProfilerProvider(); //note: not thread safe
MiniProfiler.Settings.Storage = new Log4NetStorage(container.Resolve<ILogger>()); //this is the castle-windsor wrapper interface
MiniProfiler.Settings.SqlFormatter = new SimpleSqlFormatter();


The next step was to create my own custom storage mechanism which would output the results of a profiling session to a log file. This was not entirely trivial, as the profiler session is a hierarchy of steps. Here is how I implemented the Save method of IStorage:

public void Save(MiniProfiler profiler)
{
var timings = new Stack<Timing>();

if (profiler.Root.HasChildren)
{
var children = profiler.Root.Children;
for (var i = children.Count - 1; i >= 0; i--) timings.Push(children[i]);
}
else
{
timings.Push(profiler.Root);
}

var intro = string.Format(@"{{""Timestamp"":""{0}"",""Application"":""{1}"",""Status"":""Starting"",""TransactionId"":""{2}"",""Transaction"":""{3}""}}",
profiler.Started.ToString(DATE_FORMAT),
applicationName,
profiler.Root.Id,
profiler.Root.Name);

logger.Info(intro);

decimal runningDuration = 0;

while (timings.Count > 0)
{
var timing = timings.Pop();
var name = timing.Name;

decimal customTimingDuration = 0;

StringBuilder sb = new StringBuilder();

if (timing.HasCustomTimings)
{
foreach (var item in timing.CustomTimings)
{
customTimingDuration = item.Value.Sum(v => v.DurationMilliseconds ?? 0);

decimal customOffset = runningDuration;

foreach (var item2 in item.Value)
{
customOffset += item2.DurationMilliseconds ?? 0;
logger.Info(string.Format(@"{{""Timestamp"":""{0}"",""Application"":""{1}"",""Status"":""Processing"",""TransactionId"":""{2}"",""Transaction"":""{3}"",""Action"":""{4}"",""CommandString"":""{5}"",""Duration"":""{6}""}}",
applicationName,
profiler.Root.Id,
profiler.Root.Name,
string.Format("{0}/{1}", name, item.Key),
item2.CommandString,
item2.DurationMilliseconds));
}
}
}

var offset = Convert.ToDouble(timing.DurationWithoutChildrenMilliseconds - customTimingDuration);
var msg = string.Format(@"{{""Timestamp"":""{0}"",""Application"":""{1}"",""Status"":""Processing"",""TransactionId"":""{2}"",""Transaction"":""{3}"",""Action"":""{4}"",""Duration"":""{5}""}}",
applicationName,
profiler.Root.Id,
profiler.Root.Name,
name,
offset);

logger.Info(msg);
if (sb.Length > 0)
{
logger.Info(sb.ToString());
}
if (timing.HasChildren)
{
var children = timing.Children;
for (var i = children.Count - 1; i >= 0; i--) timings.Push(children[i]);
}

runningDuration += timing.DurationWithoutChildrenMilliseconds + customTimingDuration;
}

var end = string.Format(@"{{""Timestamp"":""{0}"",""Application"":""{1}"",""Status"":""Finished"",""TransactionId"":""{2}"",""Transaction"":""{3}""}}",
applicationName,
profiler.Root.Id,
profiler.Root.Name);

logger.Info(end);

}


This isn’t particularly pretty, and I’ll be the first to admit that using string.Format to create JSON is really bad practice, but I wanted something that would be lightweight and work with log4net, which is our current logging library. Really what I should do is adopt semantic logging and use something like Serilog. But I’ll save that for another day.

The next step was to create my own ISqlFormatter, as the ones that come with Mini Profiler included a lot more detail than what I wanted to record, this was very trivial:

public class SimpleSqlFormatter : ISqlFormatter
{
public string FormatSql(string commandText, List<SqlTimingParameter> parameters)
{
return commandText;
}
}


With these two classes written plumbing it into my application became very easy and I soon had performance results being outputted as JSON into a log file, which I could then import into Splunk.

With the plumbing done, its time to actually instrument our code. With Mini Profiler what you do is call MiniProfiler.Start() and then .Stop() at the beginning and end of a session. In addition to this, you can call .Step() anywhere in between and this create a kind of sub-profiler, if for example you wanted to explicitly measure the performance of a particular part of your code.

When using Mini Profiler you will want to break your code into individual sessions, eg on a website this would be a single web request. With the idea being that you want to be able to measure and compare performance over time.

As mentioned above, the services that I want to add this to all communicate over a messaging bus. So the logical place to call .Start() and .Stop() is in the class that is invoked when a new message arrives on the bus. I also wrapped our database connection code with the Mini Profiler instrumentation code so that it can measure time spent executing SQL and what that SQL was.

### Logger Output

After wiring everything up and running through a few test cases, this is a sample of the events that were produced:

{"Timestamp":"05/12/15 13:41:50.22 Z","Application":"ApplicationA","Status":"Starting","TransactionId":"2c9d0f76-cbca-4d4f-b736-48d09a3b75c1","Transaction":"Processor/MessageA"}
{"Timestamp":"05/12/15 13:41:50.24 Z","Application":"ApplicationA","Status":"Processing","TransactionId":"2c9d0f76-cbca-4d4f-b736-48d09a3b75c1","Transaction":"Processor/MessageA","Action":"Dispatcher/sql","CommandString":"select ######### from ##### WHERE (###### = @p1 AND ##### = @p2)","Duration":"18.9"}
{"Timestamp":"05/12/15 13:41:50.22 Z","Application":"ApplicationA","Status":"Processing","TransactionId":"2c9d0f76-cbca-4d4f-b736-48d09a3b75c1","Transaction":"Processor/MessageA","Action":"Dispatcher","Duration":"669.8"}
{"Timestamp":"05/12/15 13:41:50.93 Z","Application":"ApplicationA","Status":"Processing","TransactionId":"2c9d0f76-cbca-4d4f-b736-48d09a3b75c1","Transaction":"Processor/MessageA","Action":"EasyNetQ/Publish/MesageB","Duration":"34"}
{"Timestamp":"05/12/15 13:41:50.96 Z","Application":"ApplicationA","Status":"Processing","TransactionId":"2c9d0f76-cbca-4d4f-b736-48d09a3b75c1","Transaction":"Processor/MessageA","Action":"EasyNetQ/Publish/MessageC","Duration":"4.3"}
{"Timestamp":"05/12/15 13:41:50.96 Z","Application":"ApplicationA","Status":"Processing","TransactionId":"2c9d0f76-cbca-4d4f-b736-48d09a3b75c1","Transaction":"Processor/MessageA","Action":"EasyNetQ/Publish/MessageD","Duration":"5.2"}
{"Timestamp":"05/12/15 13:41:50.97 Z","Application":"ApplicationA","Status":"Finished","TransactionId":"2c9d0f76-cbca-4d4f-b736-48d09a3b75c1","Transaction":"Processor/MessageA"}


### Building Splunk Dashboards

Since the data is being outputted as JSON it is really easy to import into Splunk, I’m not going to cover this here. With data in Splunk it’s time to turn the raw data into some useful charts.

The first one is a stacked area chart, which shows the total time spent for a transaction over time. High values on this chart aren’t necessarily indicative of performance problems, as it could just be that we were experiencing high throughput of a certain transaction.

A variation on the chart above would be to show the average duration instead of the total time, that would be more indicative of performance problems.

The next chart shows the total time spent on processing for each different transaction in the selected time period. Once again this chart isn’t necessarily indicative of a performance problem. But does show where our system spends most of its time.

The next chart shows the average duration of a transaction, this in the context of the previous chart could start to highlight potential problem areas. A variation on this chart would be to show it stacked with the number of times the transaction was executed.

The next chart shows the maximum time it took to execute a SQL statement. This is very high level but still useful in seeing SQL queries that might need to be tuned. Averages would probably be a better values to show here.

The last chart is a stacked area chart of time spent on the different actions that were measured, over time. This is more useful when looked at in the context of a single transaction type.

As an added bonus, Splunk lets you click on any of the charts and drill through to the raw events.

And here is what the dashboard looks like with everything in place, note also the filters at the top of the page, so I can drill into the different applications and individual transactions – they also cross filter 🙂

This is still very much a proof-of-concept, I’m hoping to get this running in production in the next couple of weeks. When it’s running in production I’ll post a follow up.

# Consuming JSON with Splunk in two simple steps

Last week I needed to configure Splunk to consume JSON log fies, the documentation on the Splunk website wasn’t particularly clear, and ended in some strange results with data being repeated…With the help of an old colleague of mine (thanks Matt) he pointed me in the direction of this Splunk Answers question, which described the problem that I was having as well as the solution – fixing the configuration.
So here are the steps required to setup Splunk to consume JSON from a log file. I’ll assume that you already have an instance of Splunk installed.

### Step 1 – Install the Universal Forwarder (optional)

The setup that I was working with was a Splunk server running on a Virtual Machine in Azure and an on-premise server where the log files to consume were produced. Splunk provides a useful utilities called the Universal Forwarder that consumes events data and sends it on to the Splunk server.
Installation is really straightforward so I’m not going to cover that here.

### Step 2 – Configuring a custom source type

This is the part that caught me out, from the searching that I did the first time around I learnt that I needed to setup a custom source type that told Splunk to parse the data as JSON. The mistake that I made was creating this custom source type on the remote node where I had the Forwarder installed.
To do it correctly, you will need to open/create a props.conf file on the Splunk server with the following content:
[my_custom_type]
INDEXED_EXTRACTIONS = json
KV_MODE = none
The props.conf file can be found at
$SPLUNK_HOME/etc/system/local/ If props.conf doesn’t exist in this folder (it didn’t for me) then you will need to create it. ### Step 3 – Setting up log file monitoring This is the easy part, and the part that I did do correctly, on the remote node open the inputs.conf file and add the following [monitor://c:\logs\directory\] sourcetype=my_custom_type The inputs.conf file can be found at$SPLUNK_HOME/etc/system/local
With that done, data is going in and nothing is being duplicated.

# Apache Flume – Get logs out of RabbitMQ and into HDFS

This post is an extension of Tutorial 12 from Hortonworks (original here), which shows how to use Apache Flume to consume entries from a log file and put them into HDFS.

In this post I’m going to further extend the original tutorial to show how to use Apache Flume to read log entries from a RabbitMQ queue.

Apache Flume is described by the folk at Hortonworks as:

Apache™ Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.

## Overview

• Installation and Configuration of Flume
• Generating fake server logs into RabbitMQ

To follow along you will need to:

## Installing Flume

WIth the Sandbox up and running, press Alt and F5 to bring up the login screen. You can login using the default credentials:

login: root
password: hadoop

After you’ve logged in type:

yum install -y flume

You should now see the installation progressing until it says Complete!

For more details on installation take a look at Tutorial 12 from Hortonworks.

Using the flume.conf file that is part of my tutorial files, follow the instructions to upload it into the sandbox from the tutorial. Before uploading the file, you should check that the RabbitMQ configuration matches your system:

sandbox.sources.rabbitmq_source1.hostname = 192.168.56.65
sandbox.sources.rabbitmq_source1.queuename = logs
sandbox.sources.rabbitmq_source1.port = 5672
sandbox.sources.rabbitmq_source1.virtualhost = logs

You shouldn’t need to change anything else.

For Flume to be able to consume from a RabbitMQ queue I created a new plugins directory and then upload the Flume-ng RabbitMQ library.

Creating the required directories can be done from the Sandbox console with the following command:

mkdir /usr/lib/flume/plugins.d
mkdir /usr/lib/flume/plugins.d/flume-rabbitmq
mkdir /usr/lib/flume/plugins.d/flume-rabbitmq/lib

Once these directories have been created, upload the flume-rabbitmq-channel-1.0-SNAPSHOT.jar file into the lib directory.

## Starting Flume

From the Sandbox console, execute the following command

flume-ng agent -c /etc/flume/conf -f /etc/flume/conf/flume.conf -n sandbox

## Generate server logs into RabbitMQ

To generate log entries I took the original python script (which appended entries to the end of a log file), and modified it to publish log entries to RabbitMQ.

To run the python script you will need to follow the instructions on the RabbitMQ site to install the pika client library (see details on the RabbitMQ website).

The script is setup to connect to a broker on the localhost into a virtual host called “logs”. You will need to make sure that the virtual host exists.

You can start the script by running:

python.exe c:\path\to\generate_logs.py

When this is started the script will declare an exchange and queue and then start publishing log entries.

You can see that everything is running by going over the RabbitMQ Management console.

The incoming messages are the ones being generated by the Python script and the deliver / get and ack messages are the ones being consumed by Flume.

Setting up HCatalog

The following command (from the original tutorial) can be used to create the HCatalog table (make sure you only enter it only on a single line):

hcat -e “CREATE TABLE firewall_logs (time STRING, ip STRING, country STRING, status STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|’ LOCATION ‘/flume/rab_events’;”

You should now be able to browse this table from the web interface.

To do some analysis on this data you can now follow steps 5 and 6 from the original tutorial.

# SQL Server Integration Services – An Updated HDFS Task

This is an update to my previous post about loading data into HDFS. After using the component a few times I realized that having to pass in a list of files to upload seemed a bit odd inside of a Data Flow task.

So instead I have changed the component to be an SSIS Task, instead of a Destination. This means that it is used in the Control Flow of a package, instead of in a Data Flow task.

I have also made a few other changes:

• Added a file type filter
• Improved the UI design
• Added the ability to create a new HDFS Connection Manager
• Added a UI to the HDFS Connection Manager

This is what the component now looks like:

The File Type Filter allows you to specify what types of files should be uploaded from the source directory specified. This is useful if you have a mixture of files or only want to upload a subset.

The update has been pushed to GitHub – https://github.com/kzhen/SSISHDFS.

# Load Files into HDFS using SQL Server Integration Services

### UPDATE: I’ve made a few changes to how the component works –

Recently I have been playing around with the Hortonworks Big Data sandbox, the tutorials were very good, and made it easy to get going with some sample data.

Given that I mainly use the Microsoft BI stack I was hoping for the ability to use SSIS to load my data, especially as it would be nice down the line to do this as part of an ETL process. However, Microsoft as yet don’t have first class support for HDFS as a destination.

So to fix this problem I decided to build a custom destination for SSIS that would load data into HDFS.

My goal was to be able to transfer a bunch of files from my filesystem into HDFS.

This is my directory of files

The way my custom destination currently works is by sending a list of filenames into it, these files will then be transferred into HDFS. Here is my source file with a list of files to upload.

Next up is the SSIS package. As you can see below, the package is very simple. There is a flat file source, which I’ve setup to read from the source file above, it then feeds into the HDFS Destination.

The HDFS Destination has three parts to configure:

• Configuration Manager
• Destination Path – this will be the path in HDFS to upload the files to (I had to change the permissions so all groups had write access)
• Source column – this is the column from the flat file source to read the filenames from

The HDFS Configuration Manager is setup with a hostname and port, I’ve set mine to “sandbox” and 50070. Where sandbox is setup in my hosts file to point to my Hortonworks sandbox (192.168.56.101).

After all this has been configured the SSIS package can be run.

Now, to check HDFS

All the files have been uploaded into HDFS.

The custom component uses the Microsoft Hadoop Webclient, which is part of the Microsoft Hadoop .Net SDK. The code is available on GitHub.

That wraps up the first incarnation of my HDFS Destination. I’m planning to further extend it, so that rather than sending in a list of filenames to upload you will be able to send in string data which will then be transferred as files into HDFS.

# Getting Started with the Hortonworks Sandbox

In my previous post, I made reference to the Twitter Big Data example for Microsoft StreamInsight (project page).

The sample collects tweets in real-time from Twitter then does a few things:

• Displays current information about trends in real-time on a web dashboard
• Stores information about the tweets into a SQL Azure database
• Store the actual tweets into an Azure Blob Store

Then there are some commands that you can use with Azure HDInsight to do some post-processing, this is great if you have access to the HDInsight Preview, but what if you are stuck on the waiting list? That’s where the Hortonworks Sandbox comes in!

In this post, I’m going to give an overview of getting the sandbox setup and then how to move the data collected by StreamInsight from Azure into the Sandbox. In the next post I will be showing how to do the analysis.

## Getting Started with the Sandbox

Once VirtualBox has been installed, ensure that there is a host-only network configured.

To do this, go to File -> Preferences then click “Network”. You should see an entry in the Host-only Networks:

Click on the screw-driver icon to edit the entry, and ensure that the adapter settings match the following:

Once the network is correctly configured, click OK until all the windows are closed.

The next step is to import the sandbox application. From the VirtualBox Manager, click File -> Import Applicance, this will bring up the import wizard.

On the first page click the Open appliance button, and browse to where you downloaded the sandbox. Click Next and you will see the “Appliance settings” page, you shouldn’t have to change any details.

Click Import, you will see a progress window appear. Once it has completed, you will see it in the list of servers:

### Configuring the Sandbox

The next step is to configure the network adapters.

Right click on the Hortonworks Sandbox, and click Settings. Then click Network to bring up the Network Configuration.

I configured my adapters like this:

The Sandbox is now configured.

You can now boot the sandbox.
Right click in the VirtualBox Manager and click Start. This will boot up the Sandbox, after a minute or two you will see the following:

This is where I was caught out, from looking at this window you don’t know what IP address you should use to browse to the Sandbox. By configuring the host-only network, you can browse to the sandbox on 192.168.56.101.

You will be asked to fill in some details about yourself then you’re good to go.

### Adding the SQL Server driver for Apache Sqoop

Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.

In the Twitter Big Data sample, SQL Azure is used to store information about the Tweets that it collects. The Sandbox when first setup is missing the driver for SQL Server, to add it follow these steps:
1. Find the JDBC drivers for MS SQL on Microsoft, currently sqljdbc_4.0.2206.100_enu.tar.gz
2. Using Putty, SSH onto the sandbox and logon as root with password hadoop
4. Unzip using this command: gunzip sqljdbc_4.0.2206.100_enu.tar.gz
5. Extract tar: tar –xvf sqljdbc_4.0.2206.100_enu.tar
6. Copy to the Sqoop lib directory: cp sqljdbc_4.0/enu/sqljdbc4.jar /usr/lib/sqoop/lib/

This now covers off all the configuration and setup of the sandbox.

## Setting up the Twitter Big Data Solution

You will need to have StreamInsight installed and running on your development machine, installation instructions can be found here.

The Twitter Big Data solution comes with a README file which outlines various steps that you need to go through to configure the solution. You will need to fill in the app.config file with details of your SQL Azure database and Blob store. Once that is done you can run the solution and you should see data in the web dashboard, at the same time records will be going into your blob store and SQL Azure database.

Once there is some data in the SQL Azure database and the Blob Store, we can use various components of the Hortonworks Sandbox to pull data from Azure into Hadoop.

The remainder of this post is basically going to be a re-write of the instructions found in the README file of the solution.

## Moving data from SQL Azure to Hadoop using Sqoop

Sqoop is a component in the sandbox which connects to different data sources and moves it into HDFS (hadoop distributed file system).

I wasn’t able to get Sqoop working with SQL Azure via the nice UI provided by the sandbox, so instead used SSH. So you will need to use Putty or any other SSH client to complete this stage.

• Connect to the sandbox via SSH, logon using user: root, password: hadoop
• Execute the following command:

Sqoop will now start moving data from your SQL Azure database into the sandbox.  This shouldn’t take more than a few minutes, depending on how much data you have in the SQL Azure database and your connection speed.

## Moving data from Azure Blob Storage into HDFS

The nice thing about HDInsight in Azure is that it can be configured to use Azure Storage Vault, which basically means you can map the file system of your hadoop system to Azure blob storage. Unfortunately, there is no straightforward way that I could find (if I’m wrong please let me know!) to connect the sandbox to Azure blob storage.

In the README file, the author has you map Hadoop on Azure (HDInsight) to Azure Storage Vault (ASV), because I couldn’t figure out a good way to do this, what I ended up doing was a bit clunky.

I used the Azure Storage Explorer to download all the blobs from the container, which I specified in the app.config, into a directory. I then zipped all of these files and then uploaded this zip file into the Sandbox.

Then click Upload -> Zip File:

From here select the zip file with all your tweets. This will upload the file, then extract all the files into a directory of the same name as the zip file.

## Processing the Tweets using Hive

There is a script included in the solution called “Analyze Tweets with Hive.txt”. I have made some small modifications to it, which you can download here. If you compare this file to the original you will notice that the first two configuration lines have been removed as well as some of the lines which were commented out. Also, the location for the raw_tweets table has been updated.

To process this script using the Sanbox, in your browser:

• Click on the Beeswax icon, this is the Hive UI
• In the Settings section, add the following settings:
• key: hive.exec.dynamic.partition value: true
• key: hive.exec.dynamic.partition.mode value: nonstrict
• Open the modified script, and copy and paste the contents into the query editor

Click Execute to start the script

This will then start processing the raw tweets that were stored in the Azure blob storage, which we moved into HDFS.

This will take longer to execute depending on how many tweets were collected. Unfortunately you can’t leave this unattended, as you will see the following come up:

You will need to click Next to continue processing.

## Installing the Hive ODBC Driver

To analyse the tweets using Excel you will need to install the ODBC Driver from Hortonworks, which can be found here.

You will need to make sure you install the version that corresponds to the version of Excel you have installed (32bit / 64bit).

After you have installed the driver, open Administrative Tools. Then open ODBC Data Sources 32 bit / 64 bit.

Click System DNS tab, and you should see:

Click on Configure, here you will need to fill in the host and user name fields:

At this point you now have all the Tweet information that was generated by the StreamInsight application in the Sandbox, you have the ODBC Driver configured and setup.

In the next post I will show how to analyse this information.