Visualizing Workflows with Splunk and D3

At DrDoctor we do a lot with SMS, we let patients reschedule appointments, book appointments when they move to the top of a waiting list and we provide patients with the ability to move their appointment forward by sending them lists of slots as capacity becomes available.

In all of this we use workflows to represent what is going on between the patient’s mobile phone and our system. We call them workflows but really they are just state machines built on top of the excellent .Net library Automatonymous.

Like all startups we want to ensure that we are always delivering an excellent experience to all our users, which means keeping an eye out for those who get stuck and changing the workflows accordingly.

Splunking State Changes

We’ve been using Splunk as part of our reporting system for about 4 months now. We put lots of different events into it, including a huge variety of events which are generated by our workflows. Whenever an instance of a workflow changes state we publish an event similar to this:


{
"EventId": "ee9b4347-fa3c-4b9f-bcc8-396550dd389c",
"Category": "WorkflowStateChanged",
"Payload": {
"CurrentState": "WaitingForBookingTeam",
"InstanceId": "7f205b67-bc9b-4b63-a176-b1d167ca7e29",
"InstanceType": "PartialBookingInstance",
"PreviousState": "WaitingForSlotSelection",
},
"Timestamp": "2015-07-25T11:16:58.0722714Z"
}

Listing all the different state transitions can be done with the following Splunk search:


Category=WorkflowStateChanged | stats count by Payload.PreviousState Payload.CurrentState

Which produces the following table of events

1-table-of-states

Splunk and D3

That table of data isn’t really very useful when looking at it by eye, it requires a fair bit of work to trace the path through a given workflow. That’s where D3 comes into the picture – D3 is a JavaScript library used to produce interactive charts. It’s incredibly powerful, but with that power come with a bit of learning curve.

D3 charts aren’t built directly into Splunk, however if you’re happy to learn a bit of JavaScript and a small bit of Python’s Django framework then you can do some pretty cool stuff with the Splunk Web Framework.

To produce the D3 charts shown below I followed the instructions from the Web Framework getting started tutorial, but then used a modified version of the Sankey chart based on this example by Colin Fergus. The reason for this is that I wanted to be able to show loops in the chart.

I then made a few modifications so that any link connecting to the “Final” state that doesn’t go through the second last one is colour red, and those that do coloured green. The red paths highlight to us the sad path through the workflow. So for example if we can’t match a phone number to a patient them we would transition to the final state and colour this red.

Real-world examples

The first example represents the appointment rescheduling workflow.

3-workflow-two

As you can see it can be fairly complicated. Don’t forget that this is showing the aggregation of all the different state changes, given that there are multiple paths through the workflow not all people will go through all the states.

We notices that a fair few people were getting to the point where they were set alternative slots but never selecting one, we theorized that perhaps they needed the option to request more, so we added that functionality in. As you can see from the blue backwards loop not many people have actually used this.

Here is a further example from our partial booking workflow, where we ask patients to book their appointment when the reach the top of a waiting list. What’s interesting about this is that a large number of users (patients) never actually even start the process of booking.
2-workflow-one

Wrapping Up

We have tried out various methods of visualising the paths through our workflows that our users take. We used to do this with an Excel spreadsheet which we filled up with data from a nasty SQL query and then used a column chart to show where people exited, this was before we were using Automatonymous. It worked for a while but wasn’t really sustainable as a solution due to the manual process.

As our workflows have developed and we as an organisation have evolved we have worked hard to improve how we do our reporting. With the main aims being:

  • Accessibility – decision makers and developers alike should have quick and easy access to information
  • Discoverability – new pieces of information should be exposed effortlessly
  • First class experience

As you can see from the charts above, Splunk has certainly delivered.

If you’re interested in seeing the code which makes these charts possible then leave a comment below.

Advertisement

Improving .Net Performance

Over the past couple of weeks we’ve noticed that our Reporting service is terribly slow.

splunk-dblogger

The above chart is part of our APM (Application Performance Monitoring) solution built on top of MiniProfiler and Splunk (see this blog post for more details). Our reporting service works by processing messages which are queued up by various other services onto a queue in RabbitMQ. What this chart is showing is that the reporting service is working at a very constant and slow pace. It will eventually get through the backlog of messages, but it takes it a few hours each morning while the CPU sits at around 100%.

Instead of the chart above, what I’d rather see is a big spike in activity that is sustained for a (much) shorter period of time!

In this post I’m going to show how I used the built-in tooling in Visual Studio 2013 to find and diagnose the bottlenecks in our code and then visualize the improvement in performance using Splunk and MiniProfiler.

Finding and fixing performance bottlenecks

I’ve known for a long time that Visual studio has built-in performance sampling tools, but have rarely used them. I found Sasha Goldshtein’s (blog|Twitter) Pluralsight course on Measuring .Net Performance an excellent overview of how to get started with them.

To measure the performance I first queued up a significant number of messages for the reporting service to process then started the Visual Studio Profiler.

Profiling CPU Usage

I started by setting the project I wanted to profile as the “Startup project”. Then from the Analyze menu, selected Performance and Diagnostics, which brings up the following window

1-performance-and-diagnostics

I then selected Performance Wizard under Available Tools and then clicked the Start button. This brings up a dialog where you can select a few options, I selected CPU sampling as the profiling method, then the Reporting Service project as the profiling target, then clicked finished. The Reporting service then started and Visual Studio started profiling, after about 5 minutes I stopped the profiling.

Viewing Profiler Reports and fixing bottlenecks

This is what the first profiling report looked like

2-first-perf-report

As you can see from the Hot Path, the Infrastructure.PubSub.EzNQ.BusItemTypeNameSerializer.DeSerialize method gets called lot and is potentially very slow! The good news here is that this is our code, which means I can click on this row and visual studio will take me to this method, and even highlight the expensive code.

3-first-hot-path

The class in question is an implementation of an interface which is part of EasyNetQ (a very nice .Net API for RabbitMQ which all of our services use). The interface ITypeNameSerializer let’s us customize our queue names and tells EasyNetQ what type it should deserialize a message into.

As shown above there is a clear bottleneck in this code, the fix turned out to be pretty straightforward, I created the typesCache Dictionary at the class level, and then updated the DeSerialize method to:

public Type DeSerialize(string serializedName)
{
  var typeName = serializedName.Split('_')[1];

  if (typesCache.ContainsKey(typeName))
  {
    return typesCache[typeName];
  }
  else
  {
    var type = AppDomain.CurrentDomain.GetAssemblies()
                .SelectMany(a => a.GetTypes())
                .FirstOrDefault(t => t.Name == typeName);

    typesCache[typeName] = type;

    return type;
  }
}

After making these changes I ran the profiler again to see what difference that had made, here are the profile results for the second run through

4-second-hot-path

As you can see the Infrastructure.PubSub.EzNQ.BusItemTypeNameSerializer.DeSerialize method doesn’t get a mention, looks like the fix was successful. But now a call to DataAccess.SimpleDataHelper.GetDatabase in the DbReporter class is in the hot path.

Clicking onto the ReportingService.DbReporter.WriteReportToDb row brings up the code for that method

5-writereporttodbThis c# code probably has any DBA wriggling in their seat right about now (sorry!), this method is writing each message that it processes into a corresponding table in our database using the likes of c# dynamic and SimpleData. It’s pretty powerful and very neat as we don’t have to write lots of code to persist all the different reporting events to the database.

Sadly there isn’t much that can really be done about the .Insert call, however after reading through the SimpleData documentation I realized that I don’t need to continually make calls to .GetDatabase, but rather can do this once in the constructor and SimpleData will take care of opening and closing the connections as needed.

After making these two fixes the profiler results after a third run don’t indicate anything specific that can be “fixed”, so it was time to push it up to our testing environment and do some real testing.

Comparing Performance

As I mentioned above our custom APM solution is built on top of Splunk and MiniProfiler. There are a number of advantages that we get from this that we wouldn’t get from the likes of New Relic or Stackify. One advantage is that we can add our own meta data to all the performance data we generate.

We put a Version number into all the performance data that we generate, this comes from the NuGet version number generated by TeamCity as part of our build process. Because we put this Version number in it means comparing performance between versions becomes trivial.

In Splunk I created the following dashboard, which allows me or any of my fellow developers at DrDoctor to select an Application and Transaction and then a baseline and comparison version.

comparing-performance

Judging by the results shown above, it’s working at roughly twice the speed, I’d have hoped for it to be a bit faster than that, but still I’m not complaining!

Unexpected consequences

Our Reporting service does two things it writes all IReportItem messages which it processes to their respective table in the database and the second thing is write all events into a log file for Splunk to index.

While doing some local testing I noticed that even writing to a log file was slow! The maximum speed I saw was ~50messages/second just to write to a log file! After making the fixes above the processing speed went up to 5,000 messages/second! Not insignificant that’s for sure.

6-queue speed

Send logs to Splunk with log4net UdpAppender

Last week I deployed a new Api hosted as an Azure WebApp, we wanted some reporting events to come out of it into our Splunk instance so we could keep an eye on whether it is working as expected. I started off by using the Splunk C# SDK as it looked nice and easy to add into our app.

A very trivial example of using the Splunk SDK would look like this:

public class SplunkLogger
{
	private readonly Splunk.Service service;
	private readonly Splunk.Receiver receiver;

	public SplunkLogger()
	{
		this.service = new Splunk.Service("myawesomesplunkinstance", 8089, "https");
		this.service.Login("", "");
		this.receiver = new Splunk.Receiver(service);
	}

	public void Log(string msg)
	{
		var args = new Splunk.ReceiverSubmitArgs
		{
			Source = "mysource",
			SourceType = "mysource-event",
			Index = "myindex"
		};

		receiver.Submit(args, msg);
	}
}

The Splunk SDK should have worked, but since there was A LOT of traffic at different times throughout the day the vast majority of the HTTP requests to the Splunk API just timed out.


System.Net.WebException: The operation has timed out
  at System.Net.HttpWebRequest.GetRequestStream(TransportContext& context)
  at System.Net.HttpWebRequest.GetRequestStream()
  at Splunk.HttpService.Send(String path, RequestMessage request)
  at Splunk.Service.Send(String path, RequestMessage request)
  at Splunk.Receiver.Submit(String indexName, Args args, String data)
  at Splunk.Receiver.Submit(Args args, String data)
  at Splunk.Receiver.Submit(ReceiverSubmitArgs args, String data)

After I discovered that it constantly just timed out I started to look around for an alternative to using the Splunk REST Api endpoint.

I’ve always noticed that Splunk lets you send events data to it via TCP/UDP but have never had cause or reason to use it. However given that our API was hosted as an Azure Web App and logging to the file system wasn’t an option I thought this would be the perfect time to try it out. As it turns out I was not disappointed.

 Setting up Splunk

The first thing you will need to do is configure your Splunk instance to listen out for events from a Udp port.

  1. From the Splunk settings, select Data Inputs1-data-inputs
  2. Click “Add new” UDP input2-add-new-udp
  3. This will bring up the Add Data wizard, enter the port you want Splunk to listen on (this will also need to go into your log4net config)3-add-data-wizard
  4. The next page “Input Settings” let you (optionally) specify various things about the data you are going to be sending. There are two particularly useful settings here, the sourcetype and index4-input-settings
  5. Click Review and then Submit

Splunk is now listening on that UDP port.

Important Note: Make sure that all firewall ports between your producer and Splunk are open to allow UDP traffic on the specified port! Otherwise it will silently fail.

Introducing the log4net UdpAppender

The first step of getting this working from your application is adding the log4net NuGet package to your project, next up open your log4net configuration and add the following <appender>

<appender name="SplunkAppender" type="log4net.Appender.UdpAppender">
  <remoteAddress value="<your-splunk-server>" />
  <remotePort value="<your-udp-port>" />
  <layout type="log4net.Layout.PatternLayout">
    <ConversionPattern value="%m%n" />
  </layout>
</appender>
<root>
  <level value="DEBUG"></level>
  <appender-ref ref="SplunkAppender"></appender-ref>
</root>

Now from your c# code you use log4net as you normally would:

var logger = log4net.LogManager.GetLogger("SplunkLogger");
logger.Info("logged to splunk!");

And you should start seeing your events in Splunk.

If you need to verify that log4net is sending events over UDP you can use Wireshark. Here I have set it up to capture packets from my wifi adapter and monitoring port 9977

5-wireshark

Log File Intelligence – log4net meets Splunk

At DrDoctor we are slowly adopting Splunk as our central reporting repository. We already have most of our application specific events going into it and we are already seeing some great benefits.

In this post I’m going to show the various steps I went through to get our log4net files being ingested in a useful format. Monitoring a file is easy, extracting useful fields is sometimes a challenge especially with log files.

Setting the format string

The first step was to change the format string in the log4net.config file. The main aim here was to make my life easier for when the log files are going into Splunk. By prefixing all the log4net tokens with a name means that I can write some simple, but very reliable regexes in Splunk to turn these into fields.


%utcdate{ISO8601} Release:'!release!' Version:'!version!' Thread:'[%thread]' Level:'%-5level' Logger:'%logger' Message:'%message'%newlineException:'%exception'

Those who are familiar with the log4net configuration options with notice that there are two tokens that don’t exist in the format string above. They are !release! and !version!, these are two very useful values to capture alongside our error messages as we can then start to track when new types of errors are discovered or introduced.

I’m using a custom PowerShell script in our deployment system, Octopus Deploy to set these values during the deployment phase, the release number reflects the Octopus Deploy release and the version number reflects the build number from TeamCity. Arguably we probably don’t need both, but I’m not entirely sure what I’m going to need yet so I’m going to stick with both for now.

Setting up Splunk

In our environment and most others I would assume, we are using the Splunk universal forwarder to send data to Splunk. The first step then is to add a new entry to the inputs.conf file to keep an eye on our logs directory.

[monitor://d:\logs\*.log]
sourcetype=log4net

Well that was easy, now that the Universal Forwarder is tracking the log files directory we should start seeing log entries appearing in Splunk.

This is a good start, but it would be more useful if we could start seeing the breakdown of the various entries. This is the point where we need to extract the various fields from the raw events.

To do this we need to make use of the Splunk field extractions. To extract more fields scroll down and click the link “Extract New Fields”

1-extract-new-fields

Then click “I prefer to enter the regex myself”

2-write-myown-regex

All the extractions follow a similar pattern, here is the regex to extract the log level:

Level:'(?P<Level>(([A-Za-z]*)))'

Enter that into the regex input, then click the Preview button, in the sample events you will see all the different logging levels highlighted, you will also notice a new tab called Level appear.

3-preview

Here is the complete list:

Release:'(?P<Release>(([0-9|.]*)))'
Version:'(?P<Version>(([0-9|.]*)))'
Level:'(?P<Level>(([a-zA-Z]*)))'
Message:'(?P<Message>((.+)))'

Go through the steps above for each one.

Log Intelligence

Now we can start doing some fancy queries.

Example one: number of errors by Release and host

index=logs sourcetype=log4net | stats count by Release, host

4-count-by-release-host

Example two: number of errors over time

index=logs sourcetype=log4net | timechart count

5-timechart

Example three: number of errors by application

index=logs sourcetype=log4net | rex field=source "Error\\\(?<app>(\w|\s|\d)*)" | stats count by Release, host, app | sort -Release, -count 

6-errors-by-application

Next steps

There are many possibilities, here are a couple of ideas:

  • Build a dashboard from the various queries above
  • Create some Splunk Alerts to trigger when a threshold of errors have been triggered

Measuring Application Performance with Mini Profiler and Splunk


At DrDoctor we use New Relic for performance monitoring of our web applications. However web only makes up a small part of our overall system. We make extensive use of a messaging based system and have lots of micro-services which communicate over a messaging bus.

After adding New Relic to our Windows services we soon realized that they don’t really do a good job of supporting them. The main reason for this afar as I can see from reading their docs is that New Relic depends on an active HttpContext for profiling applications.

Having used New Relic to pinpoint the performance problems in our web app, I started to think about what it would take to build something similar but for a Windows Service. This blog post is showing off a proof-of-concept that I’m currently developing.

Introducing Mini Profiler

Mini Profiler is a lightweight c# library written by the folk at StackExchange. It was designed to show performance metrics for web applications, but is flexible enough that I was able to plumb it into our windows services. Doing this is pretty easy:

MiniProfiler.Settings.ProfilerProvider = new StackExchange.Profiling.SingletonProfilerProvider(); //note: not thread safe
MiniProfiler.Settings.Storage = new Log4NetStorage(container.Resolve<ILogger>()); //this is the castle-windsor wrapper interface
MiniProfiler.Settings.SqlFormatter = new SimpleSqlFormatter();

The next step was to create my own custom storage mechanism which would output the results of a profiling session to a log file. This was not entirely trivial, as the profiler session is a hierarchy of steps. Here is how I implemented the Save method of IStorage:

public void Save(MiniProfiler profiler)
{
    var timings = new Stack<Timing>();

    if (profiler.Root.HasChildren)
    {
        var children = profiler.Root.Children;
        for (var i = children.Count - 1; i >= 0; i--) timings.Push(children[i]);
    }
    else
    {
        timings.Push(profiler.Root);
    }

    var intro = string.Format(@"{{""Timestamp"":""{0}"",""Application"":""{1}"",""Status"":""Starting"",""TransactionId"":""{2}"",""Transaction"":""{3}""}}",
                    profiler.Started.ToString(DATE_FORMAT),
                    applicationName,
                    profiler.Root.Id,
                    profiler.Root.Name);

    logger.Info(intro);

    decimal runningDuration = 0;

    while (timings.Count > 0)
    {
        var timing = timings.Pop();
        var name = timing.Name;

        decimal customTimingDuration = 0;

        StringBuilder sb = new StringBuilder();

        if (timing.HasCustomTimings)
        {
            foreach (var item in timing.CustomTimings)
            {
                customTimingDuration = item.Value.Sum(v => v.DurationMilliseconds ?? 0);

                decimal customOffset = runningDuration;

                foreach (var item2 in item.Value)
                {
                    customOffset += item2.DurationMilliseconds ?? 0;
                    logger.Info(string.Format(@"{{""Timestamp"":""{0}"",""Application"":""{1}"",""Status"":""Processing"",""TransactionId"":""{2}"",""Transaction"":""{3}"",""Action"":""{4}"",""CommandString"":""{5}"",""Duration"":""{6}""}}",
                        profiler.Started.AddMilliseconds(Convert.ToDouble(customOffset)).ToString(DATE_FORMAT),
                        applicationName,
                        profiler.Root.Id,
                        profiler.Root.Name,
                        string.Format("{0}/{1}", name, item.Key),
                        item2.CommandString,
                        item2.DurationMilliseconds));
                }
            }
        }

        var offset = Convert.ToDouble(timing.DurationWithoutChildrenMilliseconds - customTimingDuration);
        var msg = string.Format(@"{{""Timestamp"":""{0}"",""Application"":""{1}"",""Status"":""Processing"",""TransactionId"":""{2}"",""Transaction"":""{3}"",""Action"":""{4}"",""Duration"":""{5}""}}",
            profiler.Started.AddMilliseconds(Convert.ToDouble(runningDuration)).ToString(DATE_FORMAT),
            applicationName,
            profiler.Root.Id,
            profiler.Root.Name,
            name,
            offset);

        logger.Info(msg);
        if (sb.Length > 0)
        {
            logger.Info(sb.ToString());
        }
        if (timing.HasChildren)
        {
            var children = timing.Children;
            for (var i = children.Count - 1; i >= 0; i--) timings.Push(children[i]);
        }

        runningDuration += timing.DurationWithoutChildrenMilliseconds + customTimingDuration;
    }

    var end = string.Format(@"{{""Timestamp"":""{0}"",""Application"":""{1}"",""Status"":""Finished"",""TransactionId"":""{2}"",""Transaction"":""{3}""}}",
                    profiler.Started.AddMilliseconds(Convert.ToDouble(runningDuration)).ToString(DATE_FORMAT),
                    applicationName,
                    profiler.Root.Id,
                    profiler.Root.Name);

    logger.Info(end);

}

This isn’t particularly pretty, and I’ll be the first to admit that using string.Format to create JSON is really bad practice, but I wanted something that would be lightweight and work with log4net, which is our current logging library. Really what I should do is adopt semantic logging and use something like Serilog. But I’ll save that for another day.

The next step was to create my own ISqlFormatter, as the ones that come with Mini Profiler included a lot more detail than what I wanted to record, this was very trivial:

public class SimpleSqlFormatter : ISqlFormatter
{
    public string FormatSql(string commandText, List<SqlTimingParameter> parameters)
    {
        return commandText;
    }
}

With these two classes written plumbing it into my application became very easy and I soon had performance results being outputted as JSON into a log file, which I could then import into Splunk.

With the plumbing done, its time to actually instrument our code. With Mini Profiler what you do is call MiniProfiler.Start() and then .Stop() at the beginning and end of a session. In addition to this, you can call .Step() anywhere in between and this create a kind of sub-profiler, if for example you wanted to explicitly measure the performance of a particular part of your code.

When using Mini Profiler you will want to break your code into individual sessions, eg on a website this would be a single web request. With the idea being that you want to be able to measure and compare performance over time.

As mentioned above, the services that I want to add this to all communicate over a messaging bus. So the logical place to call .Start() and .Stop() is in the class that is invoked when a new message arrives on the bus. I also wrapped our database connection code with the Mini Profiler instrumentation code so that it can measure time spent executing SQL and what that SQL was.

Logger Output

After wiring everything up and running through a few test cases, this is a sample of the events that were produced:

{"Timestamp":"05/12/15 13:41:50.22 Z","Application":"ApplicationA","Status":"Starting","TransactionId":"2c9d0f76-cbca-4d4f-b736-48d09a3b75c1","Transaction":"Processor/MessageA"}
{"Timestamp":"05/12/15 13:41:50.24 Z","Application":"ApplicationA","Status":"Processing","TransactionId":"2c9d0f76-cbca-4d4f-b736-48d09a3b75c1","Transaction":"Processor/MessageA","Action":"Dispatcher/sql","CommandString":"select ######### from ##### WHERE (###### = @p1 AND ##### = @p2)","Duration":"18.9"}
{"Timestamp":"05/12/15 13:41:50.22 Z","Application":"ApplicationA","Status":"Processing","TransactionId":"2c9d0f76-cbca-4d4f-b736-48d09a3b75c1","Transaction":"Processor/MessageA","Action":"Dispatcher","Duration":"669.8"}
{"Timestamp":"05/12/15 13:41:50.93 Z","Application":"ApplicationA","Status":"Processing","TransactionId":"2c9d0f76-cbca-4d4f-b736-48d09a3b75c1","Transaction":"Processor/MessageA","Action":"EasyNetQ/Publish/MesageB","Duration":"34"}
{"Timestamp":"05/12/15 13:41:50.96 Z","Application":"ApplicationA","Status":"Processing","TransactionId":"2c9d0f76-cbca-4d4f-b736-48d09a3b75c1","Transaction":"Processor/MessageA","Action":"EasyNetQ/Publish/MessageC","Duration":"4.3"}
{"Timestamp":"05/12/15 13:41:50.96 Z","Application":"ApplicationA","Status":"Processing","TransactionId":"2c9d0f76-cbca-4d4f-b736-48d09a3b75c1","Transaction":"Processor/MessageA","Action":"EasyNetQ/Publish/MessageD","Duration":"5.2"}
{"Timestamp":"05/12/15 13:41:50.97 Z","Application":"ApplicationA","Status":"Finished","TransactionId":"2c9d0f76-cbca-4d4f-b736-48d09a3b75c1","Transaction":"Processor/MessageA"}

Building Splunk Dashboards

Since the data is being outputted as JSON it is really easy to import into Splunk, I’m not going to cover this here. With data in Splunk it’s time to turn the raw data into some useful charts.

The first one is a stacked area chart, which shows the total time spent for a transaction over time. High values on this chart aren’t necessarily indicative of performance problems, as it could just be that we were experiencing high throughput of a certain transaction.

1-transaction-overview

A variation on the chart above would be to show the average duration instead of the total time, that would be more indicative of performance problems.

The next chart shows the total time spent on processing for each different transaction in the selected time period. Once again this chart isn’t necessarily indicative of a performance problem. But does show where our system spends most of its time.

2-total-time

The next chart shows the average duration of a transaction, this in the context of the previous chart could start to highlight potential problem areas. A variation on this chart would be to show it stacked with the number of times the transaction was executed.

3-average-duration

The next chart shows the maximum time it took to execute a SQL statement. This is very high level but still useful in seeing SQL queries that might need to be tuned. Averages would probably be a better values to show here.

4-slowest-queries

The last chart is a stacked area chart of time spent on the different actions that were measured, over time. This is more useful when looked at in the context of a single transaction type.

5-transaction-details

As an added bonus, Splunk lets you click on any of the charts and drill through to the raw events.

4-event-drilldownAnd here is what the dashboard looks like with everything in place, note also the filters at the top of the page, so I can drill into the different applications and individual transactions – they also cross filter 🙂

6-dashboard

This is still very much a proof-of-concept, I’m hoping to get this running in production in the next couple of weeks. When it’s running in production I’ll post a follow up.

Consuming JSON with Splunk in two simple steps

Last week I needed to configure Splunk to consume JSON log fies, the documentation on the Splunk website wasn’t particularly clear, and ended in some strange results with data being repeated…With the help of an old colleague of mine (thanks Matt) he pointed me in the direction of this Splunk Answers question, which described the problem that I was having as well as the solution – fixing the configuration.
So here are the steps required to setup Splunk to consume JSON from a log file. I’ll assume that you already have an instance of Splunk installed.

Step 1 – Install the Universal Forwarder (optional)

The setup that I was working with was a Splunk server running on a Virtual Machine in Azure and an on-premise server where the log files to consume were produced. Splunk provides a useful utilities called the Universal Forwarder that consumes events data and sends it on to the Splunk server.
Installation is really straightforward so I’m not going to cover that here.

Step 2 – Configuring a custom source type

This is the part that caught me out, from the searching that I did the first time around I learnt that I needed to setup a custom source type that told Splunk to parse the data as JSON. The mistake that I made was creating this custom source type on the remote node where I had the Forwarder installed.
To do it correctly, you will need to open/create a props.conf file on the Splunk server with the following content:
[my_custom_type]
INDEXED_EXTRACTIONS = json
KV_MODE = none
 The props.conf file can be found at
$SPLUNK_HOME/etc/system/local/
If props.conf doesn’t exist in this folder (it didn’t for me) then you will need to create it.

Step 3 – Setting up log file monitoring

This is the easy part, and the part that I did do correctly, on the remote node open the inputs.conf file and add the following
[monitor://c:\logs\directory\]
sourcetype=my_custom_type
 The inputs.conf file can be found at
$SPLUNK_HOME/etc/system/local
With that done, data is going in and nothing is being duplicated.

No More Interruptions – Integrating Codealike and HipChat

I’ve recently started using Codealike, a service that tracks various metrics while I’m coding. The data it collects is then presented in a bunch of really useful ways to help determine when I’m being most productive, as well as the places our code base I spend most of my time and various other things.

One of the metrics they calculate as part of this process is how “focused” you are and from this they determine whether or not someone should interrupt you – they have three different levels, No Activity, Can Interrupt and Cannot Interrupt.

One of the worst things that can happen is being interrupted while you’re “in the zone” or as Codealike put it “on fire”.

1-may-i-interrupt

Codealike does provide a webpage that you could put on a display in your office which displays your current status (you can view my current status), however whilst I have two monitors at work I tend you use both of them.

We use HipChat at work as our IM of choice, so, during the last week I’ve started to set my status as Do Not Disturb when Codealike thinks I’m in the zone.

Introducing: Codealike IM Updater

To remove the interruption of having to update my status in HipChat when I’m in the zone I set out to build the Codealike IM Updater. It’s a simple application that periodically checks my status according to Codealike and update my status on HipChat accordingly.

2-im-updaterAs you can see, it is very simple – all the user needs to do is supply their Codealike username, their HipChat API Token and Email address and how you want to map the different levels from Codealike to HipChat. Optionally you can also specify an optional message that is displayed next to your name in HipChat.

The code is available on GitHub and if anyone would like to extend it to work with other IM services then I will happily accept pull requests.

Azure Stream Analytics – A Simple Proof-Of-Concept

In my last post I gave a high level overview of the CallStats proof-of-concept app that I built using Azure Stream Analytics, in this post I’m going to explain how to get it running for yourself.

Prerequisites

You will need an Azure subscription with the Stream Analytics preview enabled.

Setting up Service Bus

The first step is to configure some Event Hubs, these will be the ingest and output for Stream Analytics.

  1. From the Azure portal (http://manage.windowsazure.com) create a new Service Bus namespace, (make a note of this, as you will need it when configuring the projects).
  2. Create a new Event Hub called CallStatsIn with 8 partitions
    1. Create a shared access policy with the Manage, Send and Listen permissions
    2. Make a note of the connection string for this policy
  3. Create another Event Hub, this time called CallStatsOut with 8 partitions
    1. Create a shared access policy with the Manage, Send and Listen permissions
    2. Make a note of the policy name and the primary key

Setting up Stream Analytics

The next step is to configure a new Stream Analytics job

  1. From the Stream Analytics section in the Azure portal create a new job, you can call it whatever you like
  2. Select the Input tab, and create a new event hub input called CallStatsIn specifying JSON as the serialization format
  3. Select the Output tab and create a new event hub output called CallStatsOut, specifying JSON as the serialization format
  4. Select the Query tab, and paste the following query:
SELECT
    ConnectionName, 
    DateAdd(second,-5,System.TimeStamp) as WinStartTime, 
    System.TimeStamp as WinEndTime, 
    Sum(CASE WHEN EventType=1 THEN 1 ELSE 0 END) as Started,
    Sum(CASE WHEN EventType=2 THEN 1 ELSE 0 END) as Ended,
    Sum(CASE WHEN EventType=3 THEN 1 ELSE 0 END) as Dropped,
    Sum(CASE WHEN EventType=4 THEN 1 ELSE 0 END) as NotAnsweredBusy
FROM 
    callstatsin
GROUP BY 
    HoppingWindow(second, 5, 1), ConnectionName

Setting up the Solution

Download my sample solution from GitHub (https://github.com/kzhen/StreamAnalytics/archive/master.zip).

  1. Open in Visual Studio (as an administrator)
  2. Open the App.config file in the Consumer project
    1. Set ServiceBus.Namespace to the namespace you created above
    2. Set ServiceBus.KeyName to the name of the shared access policy for CallStatsOut Event Hub
    3. Set ServiceBus.Key to the Primary Key for the CallStatsOut Event Hub
  3. Open the App.config file in the CallProducer project
    1. Set Microsoft.ServiceBus.ConnectionString to the connection string for the CallStatsIn Event Hub

Running the Solution

Build the solution then start each project. You should see output in each of the console projects and the graph on the webpage should start updating

Azure Stream Analytics

There have been a number of new Azure services announced in recent weeks, one that caught my eye was Azure Stream Analytics. Stream Analytics is a cloud based event processing engine, which ingests real time events from various sources, runs a temporal query and then outputs the results for you to consume or to store for later analysis.

The feature set currently available is a simplified version of StreamInsight, the on-premise predecessor. I’ve written previously about using StreamInsight to visualize telephone calls in real-time. So I thought I would redo this using Stream Analytics.

The pieces of the puzzle

There are a number of different components to the overall solution

stream-analytics-telephony

What they do:

  • Fake Telephony App Server – this publishes events to the Azure Event Hub (CallStatsIn)
  • Event Consumer – this subscribes to the events which are published by Stream Analytics and broadcasts the events to a SignalR hub
  • Web Dashboard – hosts the SignalR hub and a web page which listens to the SignalR hub and updates a real-time chart as data comes in

Events and Queries

The events which the Fake Telephony App Server produces aren’t very complicated:

public class PhoneCallEvent
{
  public string ConnectionName { get; set; }
  public Guid Id { get; set; }
  public int EventType { get; set; }
}

On a random schedule it will generate a number of CallStarted, CallEnded, CallDropped, CallNotAnsweredBusy events and publishes them to the Event Hub (CallStatsIn)

In Stream Analytics I have a job which is subscribed to the CallStatsIn Event Hub and aggregates this data with the following temporal query:

SELECT
    ConnectionName,
    DateAdd(second,-5,System.TimeStamp) as WinStartTime,
    System.TimeStamp as WinEndTime,
    Sum(CASE WHEN EventType=1 THEN 1 ELSE 0 END) as Started,
    Sum(CASE WHEN EventType=2 THEN 1 ELSE 0 END) as Ended,
    Sum(CASE WHEN EventType=3 THEN 1 ELSE 0 END) as Dropped,
    Sum(CASE WHEN EventType=4 THEN 1 ELSE 0 END) as NotAnsweredBusy
FROM
    callstatsin
GROUP BY
    TumblingWindow(second, 5), ConnectionName

Web Dashboard

The above query will stream the results to the CallStatsOut Event Hub which my consumer is subscribed to. The consumer will then broadcast results as they are received to a SignalR Hub. My web dashboard is subscribed to the SignalR hub and will update my line chart in real-time.

2-live-chart

This is the output from the Fake Telephony Server App, showing the total number of events being published to the CallStatsIn Event Hub.

4-producer

This is the raw data from the CallStatsOut Event Hub, which is then broadcast to the SignalR hub which the dashboard is subscribed to

3-consumer

Wrapping Up

My first impressions from working with Stream Analytic are overall very positive. It is an easy service to get going with and has lots of potential.

So far the only downside that I’ve come across is that you’re only allowed one output per job. This means if you wanted to store the results for later processing you would need to setup two duplicate jobs with the same query but with different outputs (PITA to maintain).

I’ll put my code up on GitHub in the coming days for anyone who is interested.