In this blog entry I am going to outline the function of the Column Length Distribution profile request in the new Data Profiling Task in SQL Server 2008. BOL states about this profile request:
| Reports all the distinct lengths of string values in the selected column and the percentage of rows in the table that each length represents. This profile helps you identify problems in your data, such as values that are not valid. For example, you profile a column of United States state codes that should be two characters and discover values longer than two characters. |
I am testing this out on the [HumanResources].[Department].[GroupName] column in the AdventureWorks sample database. Here are the contents of that column:
Here is how I have configured the task:
Note that you have some additional options (highlighted) that let you define the behaviour of this profile request.
After running this task, here is the output:
OK, there's a few things to notice about the output here:
- There are 6 distinct lengths of values in the [HumanResources].[Department].[GroupName] column
- Minimum length and Maximum length are both called out explicitly
- We can order the results by value length by clicking on a column heading
- The number of rows that have a particular length for the value in [GroupName] are shown as a percentage of all the rows in the table.
This concludes my brief summary of the Column Length Distribution profile request in the Data Profiling Task that is being introduced with SQL Server Integration Services 2008. For more posts in this series please refer to SSIS: Data Profiling Task: Part 1 - Introduction
-Jamie
Disclaimer: The information in this blog post is correct at the time of publication. It may have changed before SQL Server 2008 is released.