More Reasons to Adopt Biml

SSIS packages aren’t backwards compatible – that means if you start building packages targeting SQL Server 2012 and then find out you need to also support and earlier version (like what happened to me this week), you will need to start over. However, if you used Biml to generate your packages you could spin up an instance of SQL Server 2008 in Azure, load your Biml project in BIDS and regenerate your packages targeting a different version.

You can change quickly – I recently found myself in the situation where I had developed an Extract, Transform and Load package and when it came to test them realize that they were wrong. I need to load data after it had been transformed because down stream Transforms depended on data being loaded. Here’s a picture

1-Before

What I had to do was combine the Transform and Load packages into a single one. I could have gone through and copy, pasted, joined up and fixed any problems for each and every one (there were about 30+ transforms), but that would have taken a while and been a very boring task. Instead I made a very minor change to my Biml package and then regenerated the package.

2-After

You can refactor quickly – this is in a similar vein to the point above. In another project I did my development against SQL Server 2012 when it came time to deploy I found out the client site I had to deploy to was still running SQL Server 2008. Which meant I had to say good bye to the Project Deployment model, along with the nice and easy configuration experience you get with 2012 and go back to using XML configuration files – this was a bit of a pain to figure out. Given that I had about 40 packges, the effort of going through each one to enable Package Configurations would have been immense (and boring), instead after I figured out how to do it in Biml all I needed to do was regenerate the packages.

Drag and drop doesn’t scale – it works fine for small ad-hoc packages that you end up throwing away, however I strongly hold the view that if you’re doing any serious development, building packages by hand just isn’t going to scale. This is especially true when all you’re really doing is implementing design patterns. I don’t remember the last time I dragged and dropped something in SSDT BI, and I have been doing a lot of SSIS development recently!

In conclusion and the main reason you should adopt Biml, your time scales better. All the things I have outlined above I could have done by hand, but it would have taken me a while and would have been boring. I probably would have made mistakes along the way too. Instead, by using Biml to create the packages I can concentrate on more important things (like writing this blog post, automating our database deploytment or automating our release process).

Advertisements

BIML – Passing Variables to Child Packages

A project I’ve been working on recently has consisted of lots of little packages which are called from a single coordinating package. I started off developing against SQL Server 2012, and thankfully used BIML to generate all my packages. The main controller package passes a variable down to the sub-packages, which is then subsequently used when executing a stored proc. This all worked nicely until I was told that it needed to work against SQL Server 2008 😦 Thanks to the ease with which BIML allows changes to be made I had it fixed and working again within minutes.

This post will show how to pass variables from a parent package to a child package for both SQL Server 2012 and SQL Server 2008.

SQL Server 2012

This is actually really easy for SQL Server 2012, thanks to the introduction of package parameters. The below BIML will generate two SSIS packages.

Parent – This package has a variable defined as ParentVar, the Execute Package data flow task will pass this variable as the value for the ChildVar parameter defined in the Child package.

Child – This package has a single parameter defined, ChildVar which will have the value filled in by the Parent package.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Packages>
    <Package ConstraintMode="Linear" Name="Parent">
      <Variables>
        <Variable DataType="String" Name="ParentVar">FromParentPackage</Variable>
      </Variables>
      <Tasks>
        <ExecutePackage Name="Execute Child">
          <ParameterBindings>
            <ParameterBinding VariableName="User.ParentVar" Name="ChildParam"></ParameterBinding>
          </ParameterBindings>
          <ExternalProjectPackage Package="Child.dtsx"></ExternalProjectPackage>
        </ExecutePackage>
      </Tasks>
    </Package>
    <Package ConstraintMode="Linear" Name="Child">
      <Parameters>
        <Parameter DataType="String" Name="ChildParam"></Parameter>
      </Parameters>
      <Tasks>
        <Container ConstraintMode="Linear" Name="DoStuff"></Container>
      </Tasks>
    </Package>
  </Packages>
</Biml>

SQL Server 2008

In SQL Server 2008 we don’t have the luxury of package parameters so we have to fall back to Package Configurations. To pass a variable to a child package we need to define a “Parent package variable” configuration, which tells the Child package to expect a given variable from the Parent package and which local variable it should be mapped to.

Once again the following BIML will generate two packages.

Parent – This package has the variable defined and set, as well as an Execute Package data flow task which will execute the Child package

Child – This package has a Package configuration which will map the ParentVar variable to the local ChildVar variable. You can see the Package Configurations by right clicking anywhere in the package control flow and then clicking Package Configurations, this will bring up the following:

Package-configurations

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Packages>
    <Package ConstraintMode="Linear" Name="Parent">
      <Variables>
        <Variable DataType="String" Name="ParentVar">FromParentPackage</Variable>
      </Variables>
      <Tasks>
        <ExecutePackage Name="Execute Child">
          <Package PackageName="Child"></Package>
        </ExecutePackage>
      </Tasks>
    </Package>
    <Package ConstraintMode="Linear" Name="Child">
      <Variables>
        <Variable DataType="String" Name="ChildVar" InheritFromPackageParentConfigurationString="ParentVar"></Variable>
      </Variables>
      <Tasks>
        <Container ConstraintMode="Linear" Name="DoStuff"></Container>
      </Tasks>
    </Package>
  </Packages>
</Biml>

(To use this sample, make sure that the project you generate the packages in is set to the “Package Deployment Model”)

Biml From the Trenches – Convention over Configuration

Biml from what I have observed is still in it’s infancy. Don’t get me wrong, what can actually be done with Biml is impressive, however it seems to me how we are using it and how we are talking about it still needs to mature before it can be fully and effectively embraced by the community at large.

There is a growing number of people in the SQL Server community that are contributing excellent guidance and driving forward its maturity, as well as BimlScript.com which has an ever growing set of examples to learn from. I firmly believe that the future of Biml is bright, if you don’t believe me check out this interview with the creator Scott Currie.

What I would like to see more of, is people talking about implementing design patterns in Biml, let’s have less examples and more theory.

An area that there is still little guidance on is convention based Biml.

Talk to any experienced software developer and they will tell you all about “Convention over Configuration” and how it makes their life easier, but if you’re a BI developer this might be brand new concept to you.

Wikipedia has a great description:

Convention over configuration is a software design paradigm which seeks to decrease the number of decisions that developers need to make, gaining simplicity, but not necessarily losing flexibility.

And

The phrase essentially means a developer only needs to specify unconventional aspects of the application.

In my mind there are a few different types of conventions:

  1. Those that require strict adhered
  2. Those that can be overridden
  3. On/Off by default

In this post I’m going to give two examples of the first type, those I class as “strict adherence required” from a recent ETL project I was working on. I’ll cover the other types in subsequent posts.

NB: I’m deliberately not including any Biml or screenshots of SSIS packages in this post. The reason for this is that there are numerous ways to implement design patterns, so rather than showing you how I do it, I want to introduce the concepts and get you thinking about it.

Use a convention to Extract data

What I want to achieve: I want to move data from a source system into my target database, but only a subset of tables in the source system.

My convention: For each table in the Import schema of my Target database, there is a table with the same name and structure in the source system. Then, for each table in the Import schema generate Data Flow task which will read from the source table and copy the data into the target table.

The advantage: When I need to start importing more data from the source system all I need to do is create a new table in my import schema that matches the source and then regenerate the extract package. No configuration is needed, I don’t need to maintain a list of tables which needed to be imported, the presence of the table in the Import schema is enough. Simple.

Use convention to Transform data

The ETL project that I working on already had a number of queries written which would transform the source system data to match our table structure. We had a number of tables in the Stage schema which the transformed data needed to go before being loaded into the live tables.

What I want to achieve: Transform data from the Import schema tables and load into tables in the Stage schema

My convention: For each table in the Stage schema, there is a corresponding stored procedure with a name like: [Stage table name]_Transform

The advantage: Once again, when I need to start transforming more data, only two things need be done, create new table in Stage schema, create transform stored procedure. Then I just regenerate my SSIS package, no changes required.

What’s the point?

Hopefully you’ve figured it out, but the point of “convention over configuration”, or “coding to a convention” is to remove the number of decisions that you need to make during the development effort. This allows you to spend more time on the hard stuff and let the computer take care of the rest.

BIML Tip – Use LinqPad to write C#

To see any real benefit of using BIML you will need to combine it with C# to generate dynamic packages. The only problem is, unless you fork out $1000’s for MIST then your only alternative is BIDSHelper. This is a great tool, but one of the biggest problems is that you loose IntelliSense for C#. Lately I’ve been using *LinqPad to prototype and test C# code which I then paste into my BIML files.

In this tip I’m going to show you how you can use LinqPad to speed up your BIML Scripting experience.

Just a quick note before we get started: To get the added benefit of IntelliSense in LinqPad you have to fork out $39USD for the premium edition, however as you will see this is definitely a worthwhile expense. Even if you don’t purchase the premium version you can still follow along, you’ll just miss out on the auto-complete. (I have not being paid for this endorsement).

Setting Up LinqPad

You can download LinqPad from http://www.linqpad.net/

Once you have installed and launched LinqPad you will see an empty query window.

1-main-query-window

To be able to use the BIML engine and Varigence extensions we’ll need to add a reference to the required assemblies and their namespaces. To do that, press F4, this will bring up the Query Properties window.

The first thing that you will need to do is add the assembly references. Click the Browse button and locate BimlEngine.dll (for me this is in C:\Program Files (x86)\BIDS Helper 2012). I also added a reference to WindowsBase.dll which you can do by clicking the Add button then searching for that assembly.

2-additional-references

The next thing to do is setup the namespaces, click the “Additional Namespace Imports” tab, and paste the following:


System.Data
System.Data.SqlClient
Varigence.Hadron.CoreLowerer.SchemaManagement
Varigence.Hadron.Extensions.SchemaManagement
Varigence.Languages.Biml
Varigence.Languages.Biml.Connection
Varigence.Languages.Biml.Table
Varigence.Hadron.Extensions

3-additional-namespace-imports

Click OK to close the Query Properties window.

Language Modes in LinqPad

LinqPad gives you a few different options of writing C#.

4-language-modes

 

I’ve found the most useful modes to write C# to be used with BIML is either the “C# Program” or “C# Statements” mode.

The “C# Program” mode is useful when you want to write different methods which can be reused in your BIML. Whereas “C# Statements” mode is useful when you just want to write some C# to directly paste into your BIML, you can’t break this into different methods.

Writing C# – Example 1

In this first example I’ll show you how you could use LinqPad to write some very simple C# code that will use the Varigence extension methods to query the metadata of your database.

This is useful in BIML when you want to generate either a number of different packages to perform over a series of tables, or multiple sequences in a single package over a series of tables.

Set LinqPad to use “C# Statements” mode and paste in the following code:

string connectionstring = "Data Source=.;Initial Catalog=AdventureWorksDW2012;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;";

var targetConnection = SchemaManager.CreateConnectionNode("SchemaProvider", connectionstring); 

IEnumerable<AstTableNode> sourceTables = targetConnection.ImportDB("dbo", "Dim%", ImportOptions.ExcludeViews).TableNodes; 

sourceTables.ToList().ForEach(x=>x.Name.Dump());

This will simply use the SchemaManager and ImportDB extension methods provided by Varigence to dump the names of all the tables in the AdventureWorksDW2012 dbo schema that start with Dim

5-example1-1

 

 

The next thing to do is make this a bit more useful, paste the following code:


string connectionstring = "Data Source=.;Initial Catalog=AdventureWorksDW2012;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;";

var targetConnection = SchemaManager.CreateConnectionNode("SchemaProvider", connectionstring);

IEnumerable<AstTableNode> sourceTables = targetConnection.ImportDB("dbo", "Dim%", ImportOptions.ExcludeViews).TableNodes;

foreach (var table in sourceTables)
{
table.Name.Dump();
table.GetColumnList().Dump();
}

This will now loop over all the tables and this time dump the table name and then the column list.

6-example1-2

 

These two snippets demonstrate how to get started using LinqPad.

We can now take the second snippet and plug it into a BIML file:


<!--
<#@ import namespace="System.Data" #>
<#@ import namespace="Varigence.Hadron.CoreLowerer.SchemaManagement" #>
<#@ import namespace="Varigence.Hadron.Extensions.SchemaManagement" #>

<#
	string connectionstring = "Data Source=.;Initial Catalog=AdventureWorksDW2012;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;";

	var targetConnection = SchemaManager.CreateConnectionNode("SchemaProvider", connectionstring); 

	IEnumerable<AstTableNode> sourceTables = targetConnection.ImportDB("dbo", "Dim%", ImportOptions.ExcludeViews).TableNodes;
#>
-->

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
	<Connections>
		<OleDbConnection ConnectionString="Data Source=.;Initial Catalog=AdventureWorksDW2012;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" Name="Source"></OleDbConnection>
		<OleDbConnection ConnectionString="Data Source=.;Initial Catalog=MyDW;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" Name="Target"></OleDbConnection>
	</Connections>
	<Packages>
		<Package ConstraintMode="Linear" Name="Extract">
			<Connections>
				<Connection ConnectionName="Source"></Connection>
				<Connection ConnectionName="Target"></Connection>
			</Connections>
			<Tasks>
			<!--
			<# foreach(var table in sourceTables)
			{
			#>
			-->
			<ExecuteSQL ConnectionName="Target" Name="Truncate Import Table <#=table.Name#>">
				<DirectInput>
					TRUNCATE TABLE Import.<#=table.Name#>
				</DirectInput>
			</ExecuteSQL>
				<Dataflow Name="Copy data into Source <#=table.Name#>">
					<Transformations>
						<OleDbSource ConnectionName="Source" Name="Source - <#=table.Name#>">
							<DirectInput>
								SELECT
									<#=table.GetColumnList()#>
								FROM
									<#=table.Name#>
							</DirectInput>
						</OleDbSource>
						<OleDbDestination ConnectionName="Target" Name="Target - <#=table.Name#>">
							<ExternalTableOutput Table="Import.<#=table.Name#>"></ExternalTableOutput>
						</OleDbDestination>
					</Transformations>
				</Dataflow>
			<!--
			<#
			}
			#>
			-->
			</Tasks>
		</Package>
	</Packages>
</Biml>

This BIML script will then generate a package which will use the AdventureWorksDW2012 database as a source database and copy all the data from each of the Dim tables into an equivalent table in the Import schema of the MyDW database.

Writing C# – Example 2

In this example I will show you how you could use LinqPad to write a reusable methods which can then be in your BIML files.

The example that I will use is the GetSqlServerExtendedProperties method which was provided by Dave Stein (b | t), it simply queries the extended properties on a given table.

Set LinqPad to use “C# Program” mode and paste in the following code:


AstOleDbConnectionNode targetConnection;

void Main()
{
	targetConnection = SchemaManager.CreateConnectionNode("SchemaProvider", "Data Source=.;Initial Catalog=AdventureWorksDW2012;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;"); 

	var sourceTables = targetConnection.ImportDB("dbo", "Dim%", ImportOptions.ExcludeViews).TableNodes; 

	foreach (var element in sourceTables)
	{
		GetSqlServerExtendedProperties(element);
	}

	foreach (var element in sourceTables)
	{
		var skip = element.Annotations.FirstOrDefault (a => a.Tag.Equals("ETL_Should_Skip"));

		if (annot.Text.Equals("1"))
		{
			//skip the table as it is marked to skip
			continue;
		}
		else
		{

		}
	}
}

AstNode GetSqlServerExtendedProperties(AstTableNode table)
{
	AstNode returnValue = (AstNode)table;

	var query = string.Format("SELECT name, value FROM fn_listextendedproperty(NULL, 'schema', '{0}', 'table', '{1}', NULL, NULL)", table.SchemaName, table.Name);

	var extendedProperties = ExternalDataAccess.GetDataTable(targetConnection.ConnectionString, query);

	foreach (DataRow ep in extendedProperties.Rows)
	{
		returnValue.AddAnnotation(AnnotationType.Tag, ep["value"].ToString(), ep["name"].ToString());
	}

	return returnValue;
}

This snippet will query the extended properties of all the tables in the dbo schema that starts with Dim and then loop through each of them. If it finds one called ETL_Should_Skip (lines 16 and 18) then it will skip it and go to the next one in the sequence.

This could be useful for example if you wanted to control which tables should go into your package without having this hard coded, or having to add a “control table” to your database.

The BIML would look like this:

<!--
<#@ import namespace="System.Data" #>
<#@ import namespace="Varigence.Hadron.CoreLowerer.SchemaManagement" #>
<#@ import namespace="Varigence.Hadron.Extensions.SchemaManagement" #>

<#+
AstOleDbConnectionNode conn;
#>

<#
	string connectionstring = "Data Source=.;Initial Catalog=AdventureWorksDW2012;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;";

	conn = SchemaManager.CreateConnectionNode("SchemaProvider", connectionstring); 

	IEnumerable<AstTableNode> sourceTables = conn.ImportDB("dbo", "Dim%", ImportOptions.ExcludeViews).TableNodes;
#>
-->

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
	<Connections>
		<OleDbConnection ConnectionString="Data Source=.;Initial Catalog=AdventureWorksDW2012;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" Name="Source"></OleDbConnection>
		<OleDbConnection ConnectionString="Data Source=.;Initial Catalog=AdventureWorksDW2012;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" Name="Target"></OleDbConnection>
	</Connections>
	<Packages>
		<Package ConstraintMode="Linear" Name="Extract">
			<Connections>
				<Connection ConnectionName="Source"></Connection>
				<Connection ConnectionName="Target"></Connection>
			</Connections>
			<Tasks>
			<!--
			<# 

			foreach (var element in sourceTables)
			{
				GetSqlServerExtendedProperties(element);
			}

			foreach(var table in sourceTables)
			{
				var skip = table.Annotations.FirstOrDefault (a => a.Tag.Equals("ETL_Should_Skip"));

				if (skip != null && skip.Text.Equals("1"))
				{
					//skip the table as it is marked to skip
					continue;
				}

			#>
			-->
			<ExecuteSQL ConnectionName="Target" Name="Truncate Import Table <#=table.Name#>">
				<DirectInput>
					TRUNCATE TABLE dbo.<#=table.Name#>
				</DirectInput>
			</ExecuteSQL>
			<Dataflow Name="Copy data into Source <#=table.Name#>">
				<Transformations>
					<OleDbSource ConnectionName="Source" Name="Source -"
						<#=table.Name#>">
						<DirectInput>
							SELECT
							<#=table.GetColumnList()#>
							FROM
							<#=table.Name#>
						</DirectInput>
					</OleDbSource>
					<OleDbDestination ConnectionName="Target" Name="Target -"
						<#=table.Name#>">
						<ExternalTableOutput Table="Import."
							<#=table.Name#>">
						</ExternalTableOutput>
					</OleDbDestination>
				</Transformations>
			</Dataflow>
			<!--
			<#
			}
			#>
			-->
			</Tasks>
		</Package>
	</Packages>
</Biml>

<!--
<#+
AstNode GetSqlServerExtendedProperties(AstTableNode table)
{
	AstNode returnValue = (AstNode)table;

	var query = string.Format("SELECT name, value FROM fn_listextendedproperty(NULL, 'schema', '{0}', 'table', '{1}', NULL, NULL)", table.SchemaName, table.Name);

	var extendedProperties = ExternalDataAccess.GetDataTable(conn.ConnectionString, query);

	foreach (DataRow ep in extendedProperties.Rows)
	{
		returnValue.AddAnnotation(AnnotationType.Tag, ep["value"].ToString(), ep["name"].ToString());
	}

	return returnValue;
}
#>
-->

Conclusion

If you find yourself writing lots of C# code which you’re using within BIML then it is definitely work trying out LinqPad to see if it makes your life easier.

 

Data Mining Resources for SQL Server Developers

Below are a few resources that I’ve discovered recently that show various ways of getting started with the Data Mining capabilities of SQL Server.

Enhancing Applications with SQL Server Data Mining by Peter Myers (TechEd Australia 2013)

In this excellent talk by Peter Myers, he demonstrates how easy it can be to build a data mining model then shows three different ways that they can be incorporated into an application:

  • Embedding data mining visualizations into a WinForms application
  • Using a data mining models as the source for a Reporting Services report, embedding a report into a WinForms application
  • Using the mining model to validate the “likeliness” of data input in a WinForms application

In addition to giving a very practical demonstration of creating and using a data mining model he also gives an excellent overview of some of the fundamentals associated with data mining. In particular he gives an excellent overview of the various models that are available with SQL Server.

Advanced-and Easy!-BI with Excel and SQL Server Data Mining by Rafal Lukawiecki (SharePoint Conference 2014)

In this talk by Rafal Lukawiecki (b | t), he once again gives a general overview of Data Mining, explaining some of the concepts then shows how to get started using the (free) Data Mining plugin for Excel 2013. He then moves onto building models using SQL Server Data Tools.

Most of the talk is spent within Excel, which really does show it is capable of almost anything 🙂

Forecasting with the Microsoft Time Series Data Mining Algorithm by Peter Myers (TechEd 2014)

In this talk from TechEd, Peter Myers focuses on the time series data mining algorithm in depth. He shows how it can be used to do forecasting using historical data.

The Level 300 (Experienced) classification is definitely appropriate, still if you want an in depth example of using the Time Series algorithm this is the video for you.

Data Mining in SQL Server Analysis Services by Brian Knight (SQLPass 2013)

In this excellent entry level presentation Brian Knight (of Pragmatic Works) demonstrates how simple and easy it is to get started working with Data Mining. If you’re looking for an introduction to the art of the possible this is a great place to start.

Kaggle – Data sets

In the two talks above (and plenty others I’ve seen in the past) the data set used for the examples is the well known AdventureWorksDW, if you’re a bit bored of this and fancy something different take a look at Kaggle.com.

Kaggle, if you haven’t come across it already is “…the leading platform for predictive modeling competitions.”, various companies publish interesting data sets as part of competitions, which you could enter to win prizes and money, or use the data sets for experimentation (check the TOC).

BIML – Parameter Binding with SSIS 2012

For any SSIS project that contains more than a few packages chances are you have a master (or coordinator) package that controls execution of various packages within your project. In this post I’ll show you how to use Package Parameters with BIML.

When I first started searching around on how to do this, I can across this sample on the BIMLScript.com site: http://bimlscript.com/Snippet/Details/73

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
 <Packages>
   <Package Name="PackageParameters" ConstraintMode="Parallel">
     <Parameters>
       <Parameter Name="PackageParamStr" DataType="String">Test</Parameter>
     </Parameters>
     <Variables>
       <Variable Name="TestPackageParam" DataType="String" EvaluateAsExpression="true">@[$Package::PackageParamStr]</Variable>
     </Variables>
   </Package>
 </Packages>
</Biml>

In this sample the value of the parameter is being assigned to a package variable, which is then used within the package. Unfortunately I couldn’t get this working so had to keep looking for another solution. Trying to use the parameter directly proved to be an exercise in experimentation. After a few different attempts it turned out to be very simple, here is an example of using a Package Parameter with an OLE DB Source:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
 <Packages>
  <Package Name="PackageParameters" ConstraintMode="Parallel">
   <Parameters>
    <Parameter Name="PackageParamStr" DataType="String">Test</Parameter>
   </Parameters>
   <Tasks>
    <Dataflow Name="DFT - Move some data around">
     <Transformations>
      <OleDbSource ConnectionName="Source" Name="SRC - Client">
       <DirectInput>EXEC ETL.GetSomeData @sp_param1 = ?</DirectInput>
       <Parameters>
        <Parameter VariableName="PackageParamStr" Name="@sp_param1" />
       </Parameters>
      </OleDbSource>
     </Transformations>
    </Dataflow>
   </Tasks>
  </Package>
 </Packages>
</Biml>

Line 13 shows how to use the package parameter as a parameter for the stored procedure being called by the OLE DB Source.

Building an SSIS Testing Framework

If you compared the Google results for “testing c#” and “testing SSIS” you would quickly realize that testability isn’t one of the strengths of SSIS. I’ve seen a few different frameworks which were either too complicated or didn’t really work, in this post I’m going to give you my take on building an SSIS Testing Framework.

The Goal

The overall goal of what I wanted to achieve from my test framework:

  • Simplicity – writing tests needs to be as simple and easy as possible
  • The ability to load sample data into database tables
  • The ability to run a “setup” sql script
  • The ability to run an assert script to determine if the test has passed/failed
  • Auditing/history of previous test runs

My Solution

I didn’t initially set out to build a testing framework, it just evolved into one over the course of the couple of days I was working on it. I started off by manually testing my SSIS packages and realized I was following the same steps over and over again:

  1. Load sample data
  2. Run SSIS Package
  3. Run some scripts to check if it worked

I could see that this was going to get repetitive so started to think about how to automate it. After a few iterations this is what I’ve ended up with (at time of writing)

1-TestController-package

The TestController package follows the well trodden AAA (assemble, act, assert) approach that is found among the TDD (test driven development) practitioners in the software development industry.

This package does the following:

  1. Gets a list of tests to execute
  2. Then in the foreach loop does the following for each of the test cases:

Assemble:

    1. A script task is used to get a bunch of test parameters from a SQL table, then updates the associated variables in the TestController package, at the moment these are only used to control which tables have data loaded
    2. A pre-data load setup SQL script is then executed
    3. Load Sample Data will then load any sample data for this test run
    4. A post data load setup SQL script is then executed

Act:

  1. Execute the package under test (PUT)

Assert:

  1. Run an assert SQL script

 The Downsides

The solution isn’t perfect, here are a few of the problems that it has at the moment:

  • Unable to pass package parameters to the package under test in the Act phase
  • Makes assumptions about how you want to do your setup routine
  • Is purely focused on packages that touch the database
  • Writing assertions isn’t as easy as I would like, it involves writing lots of T-SQL and copy/pasting code throughout the different assert scripts
  • All your databases need to be on the same server
  • Maintaining the variables and and connection managers is painful

Some of these I’m not terribly bothered about, for example having to have all of your databases on a single server I think is acceptable as this is for testing purposes. Whereas the duplication of assertion code could be easily fixed by creating some reusable stored procedures (e.g. Assert.TableEqual, Assert.TableEmpty etc.) similar to those offered by tSQLt.

One of the more painful areas of the framework is setting up tables to have sample data loaded, the process involves adding variables, setting up connection managers adding components to the data flow task and updating the script component. One way of fixing this issue that I thought of was generating the TestController package with BIML and have all of this automatically generated.

Wrap Up

As a starting point I’m really happy with how the TestController is working at the moment. During the development of it and the use afterwards I picked up a few bugs with the packages that I wanted to test, so it is paying for itself already 🙂

If you’re interested in trying it out for yourself then leave a comment and I’ll put it up on GitHub.

Scraping the Web with Power Query

Last year I entered the PowerBI video demo contest. Whilst I didn’t win any prizes I did learn a fair bit from going through the exercise of putting together a screencast demo (more on that another time). In this post I’m going to walk-through the web scraping part of my demo.

The website that I choose to use for my demo was the National UFO Reporting Center (more for novelty sake then any serious interest). Below you can see the listing by shape of craft (www.nuforc.org/webreports/ndxshape.html).

1-index-by-shape

If I click on one of these links then details on each individual sighting is listed

2-sample-details

The overall goal is to have all of the individual sightings retrieved from each individual page and stored into a Power Pivot table.

One thing which makes the exercise of scraping this site easy is that the index page links to each of the individual pages in a consistent manor:

http://www.nuforc.org/webreports/ndxs{SHAPE NAME HERE}.html

This means the process of scraping will follow these steps:

  1. Retrieve the list of shapes from the index page and store these
  2. Loop through each of the shapes from step 1:
    1. Dynamically generate the URI based on the shape name
    2. Retrieve the table of sightings

Retrieving the Shape index table

From Excel select the Power Query tab, then click the From Web button.

2a-from-web

This will bring up a dialog asking for the URL to the webpage, enter the URL for the shape index: http://www.nuforc.org/webreports/ndxshape.html

Doing this used to take you directly into the Power Query editor, but now there is an extra step. In the right-hand side of Excel you will see a Navigator appear:

3-navigator

This is showing all the different elements that Power Query discovered on the web page, Table 0 contains a list of all the shapes, which you can see in the preview when you hover your mouse over it. Click Table 0 then click Edit to bring up the Query Editor.

At this point all I did was a few clean up tasks:

  • Removed the Count column
  • Renamed “Reports” to “Shape”
  • Renamed the Query to “Shape List”

I also un-ticked “Load to Worksheet”, as you will see why shortly.

4-query-editor-shape-list

Retrieving individual sightings

The next step is to parse the individual sighting pages, I’ll do this in two steps. In the first step I’ll just take one of the sighting pages and using it perform any transformations or clean up tasks, then in the second step I will show you how to turn this into a formula which will take a shape name, which will be used to dynamically generate a URL for retrieval, this formula will then be used in the next section.

For my example I will be using the Cone shaped sightings (www.nuforc.org/webreports/ndxsCone.html).

Once again use the From Web button and enter the above URL, then in the navigator select Table 0 and then click Edit to open the Query Editor.

I performed the following clean up tasks:

  • Removed “Posted” and “Shape” columns
  • Used the Split Column transform on the “Date / Time” (delimited by Space, at the left most)
  • Renamed the new columns to “Date” and “Time”

The dates are in a format that Power Query is unable to recognize so I had to perform some extra transformation steps to reformat them:

  • Used the Split Column transform on the “Date” column (delimited by custom, /, at each occurrence). This will split out each date part.
  • Changed the data type of these new columns to Text
  • Inserted a custom column with the following formula:

Date.From(Text.Combine({[Date.3],[Date.1],[Date.2]}, “/”))

  • Removed the “Date.1”, “Date.2”, “Date.3” columns
  • Moved the Date column to be in the first position

5-cone-after-cleanup

The next step is to turn this into a reusable formula. Bring up the Advanced Editor from the View tab in the ribbon.

5a-advanced-editor

6-before-function

To turn this into a function replace the first two rows with the following code:

let
fnGetByShape = (shapeName as text) =>
let
Source = Web.Page(Web.Contents(Text.Replace(“http://www.nuforc.org/webreports/ndxs{shape}.html”, “{shape}”, shapeName))),

 

and insert the following at the end

in
fnGetByShape

These two code snippets will turn the query into a function. Click Done to close the Advanced Editor you will see that the the table of data that was there is now replaced with this:

7-now-a-function

You can test this out by clicking the Invoke button and entering in a shape name (e.g. Cone).

The last thing left to do is give the query a good name (fnGetSightingsByShape) and ensure that both the Load to Worksheet and Load to Data Model options are un-ticked.

Using the function

Now that both the list of shapes query and the function to retrieve each individual sighting are ready, it is time to combine them together and load the data into Power Pivot.

Open the Shape List query that was created earlier and add a custom column:

8-inserting-the-function

This will add a new column which calls the custom function. Expanding the Table in the new column will show the columns from the function (Date, Time, City, State, Duration, Summary). I had to set the data types for the Date and Time columns.

Now the only thing left to do is tick “Load to Data Model” then get a cuppa tea or coffee while Power Query loads the data into Power Pivot.

In Summary

In this post I’ve shown you how you can use Power Query to scrape data from a website, in addition I’ve also shown you how to build reusable functions which can be used to scrape different levels of pages on a website.

I’ve uploaded my finished demo to Dropbox for your enjoyment.

Debugging Custom SSIS Components

In the past I have written extensively about how to build custom components for SQL Server Integration Services, these posts have always been focused on the ‘happy path’, if you’re not familiar with this phrase it refers to the path through your application that works exactly as expected. Often times in development we have to deal with the sad path, or when things aren’t working as we would like or expect.

In this post I will show two techniques that can be used to help with debugging custom SSIS components.

Attaching the Visual Studio Debugger

You will need to have Visual Studio open with the source code for the custom component and have SSDT open with the package that uses the custom component.

From Visual Studio select the Debug menu then Attach to Process

4-debug-menu

This will bring up the following window showing a list of available processes for debugging:

5-attach-debugger-to

In the list of available processes select the instance of devenv.exe that corresponds to your SSIS solution, then click Attach. Once attached any breakpoints that you define will be hit, from where you can debug your component.

This method works just fine for when you want to debug the setup and configuration of your package, that is, anything the users of your component will be doing prior to executing the package.

The disadvantage of this method is that once you actually start executing the package the debugger will detach itself. The reason for this is because execution of the package happens within a different process. Continue reading for a better method.

Launching the Debugger from in Code

Another way to automatically start debugging is to add the following line of code:

Debugger.Launch();

Into the particular method that you want to debug or into the constructor.

1-adding-debugger-launch

Here you can see I’ve added the Debugger.Launch() code into the constructor for my HDFS Task Component, Now when I open a package in SSDT that uses this component, or if I drag a new instance of the task into an existing package the following window will appear:

2-jit-debugger

Here I can select how I want to debug the application. During development of custom components I always have an instance of visual studio open with my component solution and you can see it listed as the third option in the image above. By selecting this option and clicking Yes I will be taken back to that instance of Visual Studio which will then attach itself to my instance of SSDT. It will then break execution.

3-back-in-visual-studio

At this point I can set any breakpoints where I want to break future execution. From here press F5 to continue. Visual Studio will remain attached to SSDT and any breakpoints you’ve defined will interrupt execution and bring you back here.

The disadvantage of this method is that the “Do you want to debug this application” window will keep appearing at various points during package design unless you attach the debugger.

The advantage of using this method is that once you actually start executing your package you will again be prompted to attach the debugger, which means you will be able to debug the component during package execution.

Make sure you either remove this line of code before shipping it for actual use or wrap it around the #if DEBUG / #endif compiler directives as shown above and remember to use the Release configuration for production use!

Role Playing Dimensions – DAX vs MDX (Part 1)

A couple of weeks ago I was doing some work on an internal reporting cube. One of the measures required represents an ‘order backlog’, that is orders that have been received but haven’t yet been provisioned in our systems.

The Problem

The fact table looks something like this:

1-fact-table

A row will appear in the fact table after the order has been closed, with the provisioned date being set to NULL until it has been provisioned. The provisioned date is updated once the order has progressed in the system.

There are a few measures that I want to report on:

  • Won (£) – very simple SUM of the Amount column
  • Provisioned (£) – the amount that was provisioned in a given month
  • Accumulated Backlog (£) – The total amount, for each month, of orders not yet provisioned

For example, in January 2013:

  • Won (£): 550
  • Provisioned (£): 150
  • Backlog Change (£): 400 (Won – Provisioned)
  • Accumulated Backlog (£): 1,250 (400 + 850 from December)

The problem comes when wanting to show the Accumulated Backlog amount by the ClosedDate.

The Solution DAX and PowerPivot

Accomplishing the end result with PowerPivot turned out to be very straightforward and in my opinion much easier to implement than the MDX solution.

The first thing I did was load my tables into PowerPivot, create the relationships between them and I have also created a Calendar hierarchy in my Date table.

2-powerpivot-model

Take note that there are two relationships between the Sales table and the Date table, the active relationship is on the ClosedDate column in Sales, inactive relationship is on ProvisionedDate.

The reason for this is because I want all my reports to be based on the ClosedDate of a sales order.

Calculating Won (£) is very simple, as it is based on the ClosedDate all we need is a simple SUM:

Won (£) := SUM(Sales[Amount])

To calculate Provisioned (£) involves using the CALCULATE and USERELATIONSHIP functions, this formula below, tells PowerPivot to calculate the [Won (£)] field using the relationship between Sales and Date based on the ProvisionedDate

Provisioned (£) :=
CALCULATE (
[Won (£)],
USERELATIONSHIP ( ‘Date'[Date], Sales[ProvisionedDate] ),
NOT ( ISBLANK ( Sales[ProvisionedDate] ) )
)

Instead of using [Won (£)] in the above formula I could have very easily used SUM(Sales[Amount]), but I thought that using [Won (£)] reads better, because the provisioned amount is the amount of “Won” orders that had been provisioned.

To calculate Backlog Change (£) is then simply:

Backlog Change (£):=[Won (£)]-[Provisioned (£)]

And finally the Accumulated Backlog (£), this formula will calculate the running SUM of the [Backlog Change (£)] field for all dates, in the given context.

Accumulated Backlog (£) :=
CALCULATE (
[Backlog Change (£)],
FILTER (
ALL ( ‘Date’ ),
‘Date'[Date] <= MAX ( ‘Date'[Date] )
)
)

The resulting pivot table is exactly what I was expecting

3-pivot-table-result

With the basics implemented, I could go on to add some other interesting calculations, for example: age of oldest unprovisioned order or average time to provision order.

In the next post I will show how I implemented the above using SQL Server Analysis Services and MDX.