In this blog entry I am going to outline the function of the Column Statistics profile request in the new Data Profiling Task in SQL Server 2008. BOL states about this profile request:
Reports statistics, such as minimum, maximum, average, and standard deviation for numeric columns, and minimum and maximum for datetime columns.
This profile helps you identify problems in your data, such as dates that are not valid. For example, you profile a column of historical dates and discover a maximum date that is in the future.
That's fairly each to understand. When people think of data profiling then I suspect that discovering min/max/mean average/standard deviation statistical measures is something that most commonly comes to mind.
I won't show how to configure the task because it is fairly self-explanatory. Instead, let me jump straight to showing the output when running this task on the [Sales].[SalesOrderDetail].[UnitPriceDiscount] column:
As we can see, the task has shown us that the maximum discount given to purchases is 0.4, or 40% and the mean average discount is 0.0028. If we compare this to the actual data using the following query:
select UnitPriceDiscount, COUNT(*) as NumberOfRecords
group by UnitPriceDiscount
order by 1 asc
then we can see that this information seems accurate.
The summary that the Data Profiling task gives us is definitely useful. We could even track these statistics over time to see if the distribution of the data changes (more on that in a later blog entry).
Taking the idea further, we could run the same analysis on multiple columns:
As you can see I am running a statistical profile on the [UnitPrice] and [UnitPriceDiscount] columns although its slightly annoying that I cannot have the results for both on screen at the same time. I have posted a connect submission asking for that to be changed which you can find here.
This concludes my brief summary of the Column Statistics profile request in the Data Profiling Task that is being introduced with SQL Server Integration Services 2008. For more posts in this series please refer to SSIS: Data Profiling Task: Part 1 - Introduction
Disclaimer: The information in this blog post is correct at the time of publication. It may have changed before SQL Server 2008 is released.