Tag Archives: SSIS data cleansing

Data Cleansing with Regular Expressions in SSIS

Regular Expressions are a powerful way for you to search for patterns in strings of text. In SSIS, we can leverage Regular Expressions to assist us in cleansing data. I’m going to show you a couple different ways we can accomplish this task.

The first method involved using a Script Component within a Data Flow Task to remove unwanted characters from our data. In this example, I’ve created some junk data that contains a mixture of letter and numbers.

image

I want to eliminate the letters from each value. To accomplish this, I will use a Script Component as a Transformation.

First, drag the Script Component into your Data Flow. Make sure you select the Transformation option and click OK.

Next, open the Script Component Editor and go to the Input Columns tab. Select the field you wish you cleanse and set the Usage Type to ReadWrite.

image

Then go back to the Script tab and click the Edit Script button near the bottom.

image

Add the following code to your Script Component:

using System;
using System.Data;
using System.Text.RegularExpressions;  //<<— Added to allow the use of regular expressions
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
//TestJunk is the field I am cleansing. Make sure you use the correct field name.
       Row.TestJunk = RegExCleanse(Row.TestJunk);
    }

public String RegExCleanse(String field)
{
// This is the RegEx pattern we are looking for
String RegExPattern = “[a-z]”;

Regex objRegExp = new Regex(RegExPattern);
field = objRegExp.Replace(field, String.Empty);
return field;
}
}

Once you’ve pasted the script into the Script Component and changed the name of the field and are using the desired RegEx, click OK a couple times to close the script editor and the Script Component Editor. We are now ready to test our script.

Here are the results of our RegEx data cleanse:

image

All the letters have been removed from our text strings.

But there’s actually an easier way to handle data cleansing with Regular Expressions. It’s called the RegEx Replace Transform and its included in Task Factory developed by Pragmatic Works. So lets take a look at how easy it is to use RegEx to cleanse our data without a bit of scripting.

image

I’ve already added the RegEx Replace Transform to my Data Flow Task and connected it up to my OLE DB Source. Double-click the RegEx Replace Transform to open the editor.

Next, click the drop down arrow next to the field you wish you modify with RegEx. Now we can configure the transform to perform the desired cleansing. For the Action, select “Replace matching regular expression pattern with user defined value.” You can also choose to extract a string of text from your field based on the RegEx patten you specify.

For the Output Action, select “Replace the column data with fixed data.”

Lastly, specify the RegEx patten next to “Search RegX Pattern.” Then click OK. Your configuration should closely resemble mine, seen here.

image

Now when I run my Data Flow Task, all letters are removed from my data. The RegEx Replace Transform from Task Factory is an explosive tool that easily allows you to utilize Regular Expressions to scrub your data without the need to write complicated scripts.

Also, just an FYI. If you’re interested in learning more about RegEx or need some helping writing Regular Expressions, check out http://gskinner.com/RegExr/. This site is an amazing learning tool when it comes to RegEx.