Visualizing Workflows with Splunk and D3

At DrDoctor we do a lot with SMS, we let patients reschedule appointments, book appointments when they move to the top of a waiting list and we provide patients with the ability to move their appointment forward by sending them lists of slots as capacity becomes available.

In all of this we use workflows to represent what is going on between the patient’s mobile phone and our system. We call them workflows but really they are just state machines built on top of the excellent .Net library Automatonymous.

Like all startups we want to ensure that we are always delivering an excellent experience to all our users, which means keeping an eye out for those who get stuck and changing the workflows accordingly.

Splunking State Changes

We’ve been using Splunk as part of our reporting system for about 4 months now. We put lots of different events into it, including a huge variety of events which are generated by our workflows. Whenever an instance of a workflow changes state we publish an event similar to this:

"EventId": "ee9b4347-fa3c-4b9f-bcc8-396550dd389c",
"Category": "WorkflowStateChanged",
"Payload": {
"CurrentState": "WaitingForBookingTeam",
"InstanceId": "7f205b67-bc9b-4b63-a176-b1d167ca7e29",
"InstanceType": "PartialBookingInstance",
"PreviousState": "WaitingForSlotSelection",
"Timestamp": "2015-07-25T11:16:58.0722714Z"

Listing all the different state transitions can be done with the following Splunk search:

Category=WorkflowStateChanged | stats count by Payload.PreviousState Payload.CurrentState

Which produces the following table of events


Splunk and D3

That table of data isn’t really very useful when looking at it by eye, it requires a fair bit of work to trace the path through a given workflow. That’s where D3 comes into the picture – D3 is a JavaScript library used to produce interactive charts. It’s incredibly powerful, but with that power come with a bit of learning curve.

D3 charts aren’t built directly into Splunk, however if you’re happy to learn a bit of JavaScript and a small bit of Python’s Django framework then you can do some pretty cool stuff with the Splunk Web Framework.

To produce the D3 charts shown below I followed the instructions from the Web Framework getting started tutorial, but then used a modified version of the Sankey chart based on this example by Colin Fergus. The reason for this is that I wanted to be able to show loops in the chart.

I then made a few modifications so that any link connecting to the “Final” state that doesn’t go through the second last one is colour red, and those that do coloured green. The red paths highlight to us the sad path through the workflow. So for example if we can’t match a phone number to a patient them we would transition to the final state and colour this red.

Real-world examples

The first example represents the appointment rescheduling workflow.


As you can see it can be fairly complicated. Don’t forget that this is showing the aggregation of all the different state changes, given that there are multiple paths through the workflow not all people will go through all the states.

We notices that a fair few people were getting to the point where they were set alternative slots but never selecting one, we theorized that perhaps they needed the option to request more, so we added that functionality in. As you can see from the blue backwards loop not many people have actually used this.

Here is a further example from our partial booking workflow, where we ask patients to book their appointment when the reach the top of a waiting list. What’s interesting about this is that a large number of users (patients) never actually even start the process of booking.

Wrapping Up

We have tried out various methods of visualising the paths through our workflows that our users take. We used to do this with an Excel spreadsheet which we filled up with data from a nasty SQL query and then used a column chart to show where people exited, this was before we were using Automatonymous. It worked for a while but wasn’t really sustainable as a solution due to the manual process.

As our workflows have developed and we as an organisation have evolved we have worked hard to improve how we do our reporting. With the main aims being:

  • Accessibility – decision makers and developers alike should have quick and easy access to information
  • Discoverability – new pieces of information should be exposed effortlessly
  • First class experience

As you can see from the charts above, Splunk has certainly delivered.

If you’re interested in seeing the code which makes these charts possible then leave a comment below.

JavaScript testing – first impressions of Wallabyjs and Visual Studio

I’ve been a big fan of NCrunch for a long time now, when it comes to .Net test runners I don’t think there are any that beat it. So when I heard about the JavaScript test runner Wallaby I was intrigued enough to give it a try, even if it did mean writing JavaScript tests.

This week I started to write some new AngularJS functionality at DrDoctor and thought that now was as good a time as any to start writing a few JavaScript tests.

This post isn’t about writing Jasmine tests for Angular, there are already plenty of tutorials out on the internet, but rather about highlighting some of the features of Wallabyjs.

In line code coverage
When you have failing tests the inline coverage changes for that block to red
Show covering tests
If a test is failing it will show up in here as well and give you some details
Wallabyjs will also display the reasons your tests are failing
Highlight uncovered lines of code in a file

Improving .Net Performance

Over the past couple of weeks we’ve noticed that our Reporting service is terribly slow.


The above chart is part of our APM (Application Performance Monitoring) solution built on top of MiniProfiler and Splunk (see this blog post for more details). Our reporting service works by processing messages which are queued up by various other services onto a queue in RabbitMQ. What this chart is showing is that the reporting service is working at a very constant and slow pace. It will eventually get through the backlog of messages, but it takes it a few hours each morning while the CPU sits at around 100%.

Instead of the chart above, what I’d rather see is a big spike in activity that is sustained for a (much) shorter period of time!

In this post I’m going to show how I used the built-in tooling in Visual Studio 2013 to find and diagnose the bottlenecks in our code and then visualize the improvement in performance using Splunk and MiniProfiler.

Finding and fixing performance bottlenecks

I’ve known for a long time that Visual studio has built-in performance sampling tools, but have rarely used them. I found Sasha Goldshtein’s (blog|Twitter) Pluralsight course on Measuring .Net Performance an excellent overview of how to get started with them.

To measure the performance I first queued up a significant number of messages for the reporting service to process then started the Visual Studio Profiler.

Profiling CPU Usage

I started by setting the project I wanted to profile as the “Startup project”. Then from the Analyze menu, selected Performance and Diagnostics, which brings up the following window


I then selected Performance Wizard under Available Tools and then clicked the Start button. This brings up a dialog where you can select a few options, I selected CPU sampling as the profiling method, then the Reporting Service project as the profiling target, then clicked finished. The Reporting service then started and Visual Studio started profiling, after about 5 minutes I stopped the profiling.

Viewing Profiler Reports and fixing bottlenecks

This is what the first profiling report looked like


As you can see from the Hot Path, the Infrastructure.PubSub.EzNQ.BusItemTypeNameSerializer.DeSerialize method gets called lot and is potentially very slow! The good news here is that this is our code, which means I can click on this row and visual studio will take me to this method, and even highlight the expensive code.


The class in question is an implementation of an interface which is part of EasyNetQ (a very nice .Net API for RabbitMQ which all of our services use). The interface ITypeNameSerializer let’s us customize our queue names and tells EasyNetQ what type it should deserialize a message into.

As shown above there is a clear bottleneck in this code, the fix turned out to be pretty straightforward, I created the typesCache Dictionary at the class level, and then updated the DeSerialize method to:

public Type DeSerialize(string serializedName)
  var typeName = serializedName.Split('_')[1];

  if (typesCache.ContainsKey(typeName))
    return typesCache[typeName];
    var type = AppDomain.CurrentDomain.GetAssemblies()
                .SelectMany(a => a.GetTypes())
                .FirstOrDefault(t => t.Name == typeName);

    typesCache[typeName] = type;

    return type;

After making these changes I ran the profiler again to see what difference that had made, here are the profile results for the second run through


As you can see the Infrastructure.PubSub.EzNQ.BusItemTypeNameSerializer.DeSerialize method doesn’t get a mention, looks like the fix was successful. But now a call to DataAccess.SimpleDataHelper.GetDatabase in the DbReporter class is in the hot path.

Clicking onto the ReportingService.DbReporter.WriteReportToDb row brings up the code for that method

5-writereporttodbThis c# code probably has any DBA wriggling in their seat right about now (sorry!), this method is writing each message that it processes into a corresponding table in our database using the likes of c# dynamic and SimpleData. It’s pretty powerful and very neat as we don’t have to write lots of code to persist all the different reporting events to the database.

Sadly there isn’t much that can really be done about the .Insert call, however after reading through the SimpleData documentation I realized that I don’t need to continually make calls to .GetDatabase, but rather can do this once in the constructor and SimpleData will take care of opening and closing the connections as needed.

After making these two fixes the profiler results after a third run don’t indicate anything specific that can be “fixed”, so it was time to push it up to our testing environment and do some real testing.

Comparing Performance

As I mentioned above our custom APM solution is built on top of Splunk and MiniProfiler. There are a number of advantages that we get from this that we wouldn’t get from the likes of New Relic or Stackify. One advantage is that we can add our own meta data to all the performance data we generate.

We put a Version number into all the performance data that we generate, this comes from the NuGet version number generated by TeamCity as part of our build process. Because we put this Version number in it means comparing performance between versions becomes trivial.

In Splunk I created the following dashboard, which allows me or any of my fellow developers at DrDoctor to select an Application and Transaction and then a baseline and comparison version.


Judging by the results shown above, it’s working at roughly twice the speed, I’d have hoped for it to be a bit faster than that, but still I’m not complaining!

Unexpected consequences

Our Reporting service does two things it writes all IReportItem messages which it processes to their respective table in the database and the second thing is write all events into a log file for Splunk to index.

While doing some local testing I noticed that even writing to a log file was slow! The maximum speed I saw was ~50messages/second just to write to a log file! After making the fixes above the processing speed went up to 5,000 messages/second! Not insignificant that’s for sure.

6-queue speed