This post is to demonstrate the step by step process of cleansing data with SSIS DQS Cleansing Component and to describe the Data Quality Services (DQS) of SqlServer2012
As data is Precious for companies the accuracy of data is one of the core requirement for Data Systems .The idea to make the data as accurate as possible is the key goal of Data Quality assurance.
Data Quality Services (DQS) enables to manage the data quality of information systems in an easy and effective way.
Key Points of DQS.
- The PreRequisites for DQS Clent Requires .NetFramework 4.0 and internet explorer 6.0 or later
- DQS can be installed as an external service on a separate server to maintain knowledgebase
- In DQS Knowledge Base is used to build knowledge for data quality,the knowledge base consists of domains which defines the knowledge and consists of various rules and properties ,a domain is created for each field for which we want to maintain data quality
- Values in DQS are case insensitive
- When you create a domain select the right data type , you cant edit it once a domain has been created
SSIS has new component of DQS called DQS Cleansing which helps in performing cleansing with DQS in an automated batch process
Cleansing data with SSIS DQS Cleansing componant Step by step
In this demo i will create a knowledge base in Data quality Services Client with a single domain which essentially validates Date values using a regular expression and i will use that knowledge base in SSIS DQS Cleansing Component.
Creating a Knowledge Base: In order to create a knowledge base open the DQS client and click new knowledge base and select activity domain management as shown below
Creating a Domain.
After you select activity domain management from previous step it takes to Domain Management ,in domain management click create new domain and give a name and data type as shown below
The next step is to create a rule for this domain ,click domain rule in domain management section and select Value Matches the regular expression and provide a valid regular expression, in my case i'm validating Birthday field from Adventureworkssdw2012 Employee table with this regular expression (^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$), which essentially checks whether the date is within the range of 1900-01-01 and 2099-12-31
In the next steps i have created a package and selected AdventureWorks2012DW Employee table as source and connected it to DQS Cleansing Component and configured the DQS component to connect to DQS Knowledge base
The Next step is to map the Birthday column from the source with the Domain employeebday domain which we created in our knowledgebase
When you map the source column to the domain ,DQS Cleansing component creates three columns with Source,output and status alias as shown in above image
In the next step i have used a derived column to add a column based on valid statusIn the next step i have used a conditional split to filter the valid birth dates
The completed package looks like this
As you can see in the package i have a total 297 records for employee table ,out of which 296 records go through the validation sucessfully and one record fails validation as this record with Birthday a value of 1876-10-19 falls out of valid range of date (1900-01-01 and 2099-12-31) with our domain rule
No comments:
Post a Comment
Note: only a member of this blog may post a comment.