SqlServer Business Intelligence: January 2013

Tuesday, 22 January 2013

Handling Inferred Members in SSIS

In datwarehouse there are times ,when the dimensional data arrives late than the fact transactional data . The problem these records create is, during the processing of the fact table we cant get the dimension surrogate keys from the dimension table as they are not yet present in the dimension table .Therefore during fact table processing ,if we exempt these transactions ,than our totals will be wrong as some transactions are lost .

The solution to this problem is to insert inferred rows in to the dimension table for the transactional records which are not present in the dimension .For these inferred rows you need to insert the related natural keys from transactional data into the dimension,than set the inferred member indicator column value to true and set the remaining attributes to unknown value in the dimension with leaving metadata columns as it is .

After performing the insertion of inferred members into dimension as described previously,than load the fact tables ,so that when you perform the natural key lookup with the dimension table natural key you will not loose those transactions as those natural keys are already loaded onto the dimension table inferred records prior to loading fact table

At this stage we need to note that, even though the rows are included in the fact table they are not truly associated with the actual dimensional data , we need to create an etl logic to process dimension, so that when the actual row appears from the dimension table source ,it will update those inferred records with the actual attributes from the dimension source

SSIS implementation of inferred members

I'm using only two dimensions employee , date and one fact table for this demo.

The Dimensional source data looks like below

The transactional source data looks like below

As you can observe that the transactional data contains two records with employee id 16,17 which haven't got the relevant employee id in the dimensional data

I have described the implementation of the ssis package later in the article ,but to start the demo ,When you run the package with source data as shown above,the employee dimension looks like below ,as you can see the package inserted two new records for the employeekeys from transactional data into the dimension table and set the inferred members indicator to true (i.e 1) and also inserted unknown value into the employee name attribute of the employee dimension ,which is the only single other attribute we have included a part from natural key and metadata columns.

In the next run i will insert two more records into the dimensional source which correspond to the previous inferred rows 16,17 of the trasancational data as shown below

Now after i run the package ,the employee dimension updates inferred rows with employeeid 16,17 with there actual dimensional data and updates its inferred indicator to false as shown below

The developed SSIS Package for this demo looks like below

The package starts with truncating a stage table which contains the employeeid values of facttable as these values are needed to insert the inferred members int to the dimension ,this staging table is loaded before loading the fact data as highlighted in the below image

After truncating the staging table ,we load the dimension ,for this dimension i have used scd component for updating the inferred members and added few metadata columns as shown below

In the next step we load only column empid into the staging stable which we will use in the next for inserting inferred records into the dimension ,the staging load is shown below

In the next step we use a execute sql task to insert inferred records as shown below

And finally we will load the fact table .

Monday, 21 January 2013

Implementing Hybrid Slowly Changing Dimension using SSIS

Hybrid Slowly changing dimension is a specific type of SCD which implements both Type1 and Type2 in the same dimension.

Business users in certain domains have requirement to analyse the data both with the current version and historic perspective in order to gain useful insights ,as we know the type 1 scd satisfies the analyse of the data for the current version and type2 scd enables to analyse the complete history of the data .

To be specific ,consider a customer data as shown below

I will take the customer Jimmy Adams to show the change of occupation using our demo ,Currently Jimmy Adams is from Cambridge and he is a customer of our imaginary company and his current occupation is student as shown above

Lets imagine that Jimmy Adams starts working and the occupation changes to working as shown below

Our requirement is to analyse the data with the current status and with the complete employment status history which is essentially type1 scd and type2 scd as sown below

I will be implementing this requirement with ssis 2012,I have created a ssis package ,the control flow of the package looks like below

It contains a for each loop container which loops through the text files in a folder and maps its location to a variable and i will be using that variable to assign to an expression of my customer source connection string in the data flow ,so that it enables to extract all the existing files of type txt in that folder

The for each loop container collection settings is shown below

The for each loop variable mapping is shown below

After the settings in control flow, we will go to the data flow and will create a flat file source and set the connection string of the flatfile adapter using an expression in properties window , the expression maps to our for each loop variable and gets the file location as shown below

To keep it simple i'm using inbuilt slowly changing dimension component,which i will explain later how to configure specific to our scenario.

The data flow starts by extracting the customer files ,than we add a copy column to create another column to meet our requirement of type 1 and type 2, as we need another column fo to hold the status of employment history,our data flow looks like below

We will configure the columns including the newly created employment status column called Employment Status History in the scd wizard,double click the scd wizard connect to our destination customer dimension, map the source columns with the customer dimension columns and select the business key as shown below

We are not mapping the start date ,End State and Status Column as they are meta data columns,our customer dimension looks like below,please ignore the data in the diagram at this stage ,i will explain at the end

After the previous settings in scd wizard click next and configure the settings ,at this stage you need to determine which columns are of which type ,in my case except employment Status previous column all other columns are set to changing attributes which are of type1 and employment status previous column which we created earlier using our copy column is set to a historic attribute to make it type 2,the settings are shown in below image

After the previous setting, click next in the wizard and make the settings as shown below for changing attributes, by clicking the check box in this wizard , we are telling scd component to update all the changing attributes (type1) to the current latest value as shown below

Click next in the wizard and make the settings for our historic attribute ,Previous employment status column is the only column we mapped for historic attribute previously ,we will be using the meta data columns Start Data,End Date and System time variable to manage the type 2 changes and i will not go through the details of this as there are lot of article on the net which explain the details.The adavantage of the time stamps start date and end date is it allows us to query the status of the employment at any point in time .The below image shows the settings which we described

Click next and accept the default setting for inferred members configuration, as we are not implementing any inferred members ,and finally click next to finish the wizard configuration .

After we run the package with the source data for the first time as shown before ,our dimension will be loaded with the customer attributes as shown below,currently the employment status current and employment status previous are same ,as we haven't got any changes as of now

We will insert a new record for Customer JimmyAdams now with new employment status working as shown below

Now will run the package again,and it will insert a new record and updates the end dates of the previous record for jimmy adams as shown below.

As shown above the current employment status of Jimmy Adams is updated to current values which in our case contains updated working status value and previous status column contains all the values which include student and working also the previous expiry date is shown in end date column

Saturday, 19 January 2013

Implementing Slowly Changing Dimension Type 3 (SCD 3 ) with SSIS

This article defines about Slowly Changing Dimension 3 ,the uses of slowly changing dimension Type 3 and its implementation using SSIS .

Slowly Changing Dimension 3:

Some times in business ,customers regional grouping changes from one region to another region over the time ,the requirement for analyses of the complete data by the new region and the analyses of the complete data by the old region is necessary , scd type 3 will make this possible

Imagine a customer who belongs to a West region group has been moved to Northwest group over the time, and we want to analyse all the sales data by West and as well as Northwest group

Type 1 doesn't meet the requirement as it updates all the old values with the new values ,in this case all the values will be updated from West to Northwest for that customer

Type 2 doesn't meet the requirement either as it associates all the old values with West region and the new values with the Northwest

The Customers source data which i have prepared for this demo contains customers information with there regions information and it looks like as shown below

Fact Source data looks like this:

I have created three separate packages one for loading customer dimensio,n another for loading fact sales , and a master package ,the date dimension is a standard data dimension and have used from AdventureWorksDW2012 DB .

The customer dimension control flow looks like this

The customer dimension Data flow looks like this

The data flow starts by extracting the data from customer source ,the extracted records are referenced with the customer dimension to check for new customers and for existing customers we will pickup the current region and previous region fields

Customers mapping of the new customers looks like below image,as we notice we don't map previous region, as there will be no previous region for new customers

In the next step I'm using conditional split to check the change of regions as shown below

At this stage i'm collecting the regions which have changed in a staging table and updating it using Execute Sql Task as shown below

As you can see in the query what I'm doing is updating the current region with the new region and previous region with the current region . After i run the master package ,the customer dimension and factsales will be inserted with the initial regional values as shown below and at this stage we have only current regions

I'm inserting a new customer record and its associated two sales for which there is a change of region as shown below

The customer dimension and factsales after the insertion of new records is shown below

As you can see the change of the customer region from West to North West ,and the associated fact records for the customer tony ,now we can analyse the data both with the new and old region .

One key point about SCD3 is that, the history will be reinstated from second change as we are holding only one column for previous history,if you need to analyse data with more changes ,than you need to include more columns and change the etl logic as necessary

Thursday, 17 January 2013

Data Cleansing with Data Quality Services and SSIS DQS Cleansing Componant

This post is to demonstrate the step by step process of cleansing data with SSIS DQS Cleansing Component and to describe the Data Quality Services (DQS) of SqlServer2012

As data is Precious for companies the accuracy of data is one of the core requirement for Data Systems .The idea to make the data as accurate as possible is the key goal of Data Quality assurance.

Data Quality Services (DQS) enables to manage the data quality of information systems in an easy and effective way.

Key Points of DQS.

The PreRequisites for DQS Clent Requires .NetFramework 4.0 and internet explorer 6.0 or later
DQS can be installed as an external service on a separate server to maintain knowledgebase
In DQS Knowledge Base is used to build knowledge for data quality,the knowledge base consists of domains which defines the knowledge and consists of various rules and properties ,a domain is created for each field for which we want to maintain data quality
Values in DQS are case insensitive
When you create a domain select the right data type , you cant edit it once a domain has been created

SSIS has new component of DQS called DQS Cleansing which helps in performing cleansing with DQS in an automated batch process

Cleansing data with SSIS DQS Cleansing componant Step by step

In this demo i will create a knowledge base in Data quality Services Client with a single domain which essentially validates Date values using a regular expression and i will use that knowledge base in SSIS DQS Cleansing Component.

Creating a Knowledge Base: In order to create a knowledge base open the DQS client and click new knowledge base and select activity domain management as shown below

Creating a Domain.

After you select activity domain management from previous step it takes to Domain Management ,in domain management click create new domain and give a name and data type as shown below

The next step is to create a rule for this domain ,click domain rule in domain management section and select Value Matches the regular expression and provide a valid regular expression, in my case i'm validating Birthday field from Adventureworkssdw2012 Employee table with this regular expression (^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$), which essentially checks whether the date is within the range of 1900-01-01 and 2099-12-31

In the next steps i have created a package and selected AdventureWorks2012DW Employee table as source and connected it to DQS Cleansing Component and configured the DQS component to connect to DQS Knowledge base

The Next step is to map the Birthday column from the source with the Domain employeebday domain which we created in our knowledgebase

When you map the source column to the domain ,DQS Cleansing component creates three columns with Source,output and status alias as shown in above image

In the next step i have used a derived column to add a column based on valid status

In the next step i have used a conditional split to filter the valid birth dates

The completed package looks like this

As you can see in the package i have a total 297 records for employee table ,out of which 296 records go through the validation sucessfully and one record fails validation as this record with Birthday a value of 1876-10-19 falls out of valid range of date (1900-01-01 and 2099-12-31) with our domain rule

Wednesday, 9 January 2013

Data DeDuplication using SSIS 2012

Data Depulication using SSIS 2012

The De Duplication componants used in this demo are

Sort
Lookup
FuzzyLookup
FuzzyGrouping

Sort :Sort removes duplicated rows based on sort key

LookUp:Removes duplicate rows that are exact matches agaisnt the reference dataset

FuzzyLookup:Removes duplicates based on approximate mappings with the reference dataset

Fuzzygrouping:Identifies duplicates based on approximate matching against its own Dataset

Further explanation of data cleaning technology by Microsoft research can be found at http://sites.computer.org/debull/A12june/cleaning.pdf

For this demonstration Customers table of Adventureworksdw2012 is used as the reference dataset ,I have took a sample of few customers from the customer table and made few changes to make it a new dataset for deduping with the existing customer table

The existing sample of customer table data for the purpose of this demo looks like below

New Sample dataset which has been created for de duplicating with the reference customer table looks like below

I have Created a ssis package in which de duplication is done for various rules using SSIS Deduplication tasks

The package starts by extracting data from the sample dataset which i have shown above,the deduplication in this package is done step by step as per the rules ,the first deduplication task used is Lookup ,we lookup the email address with the email address of customer table

If there are no matching records ,than we assume its a possible new record and pass the non matching records

The non matching records are passed to sort task for further de duplication,this sort task is used for removing any duplicate new records based on email address

The remaining records with distinct email address are passed on to another sort for further deduplication,now the deduplication is done on firstname,lastname,birthdate

The distinct records with first name ,last name and birthdate based on sort de duplication is passed to fuzzygrouping task to assign a score based on similarity in in the same dataset and for further filtering is done using a conditional split

The conditional split is than filters the duplicate records based on the score ,in this example if i assume that a score of <1 data-blogger-escaped-1="" data-blogger-escaped-a="" data-blogger-escaped-and="" data-blogger-escaped-down="" data-blogger-escaped-duplicate="" data-blogger-escaped-duplicates="" data-blogger-escaped-flow="" data-blogger-escaped-have="" data-blogger-escaped-is="" data-blogger-escaped-more="" data-blogger-escaped-non="" data-blogger-escaped-p="" data-blogger-escaped-pass="" data-blogger-escaped-score="" data-blogger-escaped-than="" data-blogger-escaped-the="" data-blogger-escaped-which="">

The filtered records are than passed to another lookup for furter de duplication based on firstname,lastname,birthdate with the reference dataset

A fuzzy lookup is used to identify possible duplicates based on approximate matching on first name last name and address and assign a score for further filtering

Now the remaining records are filtered using conditional split for a assigned score of more than .80 is considered a duplicate and the remaining records less than .80 score are passed for loading as new customers

The package dataflow is shown below

The package as shown in the image above starts the data flow with a total of 39 newly extracted records from a sample dataset,which are passed into the first de duplicate task Lookup, which than compares the newly extracted records with the existing reference dataset( in my case i'm using customer table from Adentureworksdw2012 )based on email address column ,the lookup than identifies the 28 duplicate records and passes the non matching 11 records to the sort
The sort is the second de duplicate task which is used to remove duplicate email addresses from the new dataset and removes 1 duplicate email and passes to the next sort for identifieng any duplicate records based on firstname ,lastname and birth date ,it doesnt find any duplicates and passes the records to fuzzygrouping task
Fuzzy Grouping task than gives a similarity score on the existing dataset and we use a conditional split to filter ,the conditional split filters one record for a similairty score less than 1 as duplicate and passes the remaining records to lookup
Lookup Task matches the dataset with customer table Adventureworksdw2012 on firstname,lastname ,birth date and identifies 3 duplicates and sends the remaining 6 records to fuzzy lookup which does a fuzzy lookup
Fuzzy lookup than matches the new dataset with the customer table of Adventureworksdw2012 using first name,last name and address and assigns a score ,we than use a conditional split to filter records with score greater than .80 as duplicates which filters two duplicates and passes the remaining four records as new records

This article demonstrates one of the top ten futures of SSIS 2012 which is presented in a demo at TED by Peter Myers

SqlServer Business Intelligence