Wednesday, 9 January 2013

Data DeDuplication using SSIS 2012

Data Depulication using SSIS 2012
The De Duplication componants used in this demo are
  • Sort
  • Lookup
  • FuzzyLookup
  • FuzzyGrouping


Sort :Sort removes duplicated rows based on sort key
LookUp:Removes duplicate rows that are  exact matches agaisnt the reference dataset
FuzzyLookup:Removes duplicates based on approximate mappings with the reference dataset
Fuzzygrouping:Identifies duplicates based on approximate matching against its own Dataset


Further explanation of data cleaning technology by Microsoft research can be found at http://sites.computer.org/debull/A12june/cleaning.pdf

For this demonstration Customers table of Adventureworksdw2012 is used as the reference dataset ,I have took a sample of few customers from the customer table and made few changes to make it a new dataset for deduping with the existing customer table

The existing sample of customer table data for the purpose of this demo looks like below

New Sample dataset which has been created for de duplicating with the reference customer table looks like below

I have Created a ssis package in which de duplication is done for various rules using SSIS Deduplication tasks

The package starts by extracting data from the sample dataset which i have shown above,the deduplication in this package is done step by step as per the rules ,the first deduplication task used is Lookup ,we lookup the email address with the email address of customer table


If there are no matching records ,than we assume its a possible new record and pass the non matching records

The non matching records are passed to sort task for further de duplication,this sort task is used for removing any duplicate new records based on email address
The remaining records with distinct email address are passed on to another sort for further deduplication,now the deduplication is done on firstname,lastname,birthdate
The distinct records with first name ,last name and birthdate based on sort de duplication is passed to fuzzygrouping task to assign a score based on similarity in in the same dataset and for further filtering is done using a conditional split

The conditional split is than filters the duplicate records based on the score ,in this example if i assume that a score of <1 data-blogger-escaped-1="" data-blogger-escaped-a="" data-blogger-escaped-and="" data-blogger-escaped-down="" data-blogger-escaped-duplicate="" data-blogger-escaped-duplicates="" data-blogger-escaped-flow="" data-blogger-escaped-have="" data-blogger-escaped-is="" data-blogger-escaped-more="" data-blogger-escaped-non="" data-blogger-escaped-p="" data-blogger-escaped-pass="" data-blogger-escaped-score="" data-blogger-escaped-than="" data-blogger-escaped-the="" data-blogger-escaped-which="">
The filtered records are than passed to another lookup for furter de duplication based on firstname,lastname,birthdate with the reference dataset
A fuzzy lookup is used to identify possible duplicates based on approximate matching on first name last name and address and assign a score for further filtering
Now the remaining records are filtered using conditional split for a assigned score of more than .80 is considered a duplicate and the remaining records less than .80 score are passed for loading as new customers

The package dataflow is shown below
  • The package as shown in the image above starts the data flow with a total of 39 newly extracted records from a sample dataset,which are passed into the first de duplicate task Lookup, which than compares the newly extracted records with the existing reference dataset( in my case i'm using customer table from Adentureworksdw2012 )based on email address column ,the lookup than identifies the 28 duplicate records and passes the non matching 11 records to the sort
  • The sort is the second de duplicate task which is used to remove duplicate email addresses from the new dataset and removes 1 duplicate email and passes to the next sort for identifieng any duplicate records based on firstname ,lastname and birth date ,it doesnt find any duplicates and passes the records to fuzzygrouping task
  • Fuzzy Grouping task than gives a similarity score on the existing dataset and we use a conditional split to filter ,the conditional split filters one record for a similairty score less than 1 as duplicate and passes the remaining records to lookup
  • Lookup Task matches the dataset with customer table Adventureworksdw2012 on firstname,lastname ,birth date and identifies 3 duplicates and sends the remaining 6 records to fuzzy lookup which does a fuzzy lookup
  • Fuzzy lookup than matches the new dataset with the customer table of Adventureworksdw2012 using first name,last name and address and assigns a score ,we than use a conditional split to filter records with score greater than .80 as duplicates which filters two duplicates and passes the remaining four records as new records

This article demonstrates one of the top ten futures of SSIS 2012 which is presented in a demo at TED by Peter Myers

No comments:

Post a Comment

Note: only a member of this blog may post a comment.

Power BI Report Server

Power BI Report Server  is a new product which can be used to host both Power BI reports and SSRS reports. Currently beta version is ava...