Welcome to EMC Consulting Blogs Sign in | Join | Help

SSIS Junkie

SSIS: Data Profiling Task: Part 5 - Column Statistics

In this blog entry I am going to outline the function of the Column Statistics profile request in the new Data Profiling Task in SQL Server 2008. BOL states about this profile request:

Reports statistics, such as minimum, maximum, average, and standard deviation for numeric columns, and minimum and maximum for datetime columns.

This profile helps you identify problems in your data, such as dates that are not valid. For example, you profile a column of historical dates and discover a maximum date that is in the future.

That's fairly each to understand. When people think of data profiling then I suspect that discovering min/max/mean average/standard deviation statistical measures is something that most commonly comes to mind.

I won't show how to configure the task because it is fairly self-explanatory. Instead, let me jump straight to showing the output when running this task on the [Sales].[SalesOrderDetail].[UnitPriceDiscount] column:

image

As we can see, the task has shown us that the maximum discount given to purchases is 0.4, or 40% and the mean average discount is 0.0028. If we compare this to the actual data using the following query:

select    UnitPriceDiscount, COUNT(*) as NumberOfRecords
from    Sales.SalesOrderDetail
group    by UnitPriceDiscount
order    by 1 asc

then we can see that this information seems accurate.

image

The summary that the Data Profiling task gives us is definitely useful. We could even track these statistics over time to see if the distribution of the data changes (more on that in a later blog entry).

Taking the idea further, we could run the same analysis on multiple columns:

image

As you can see I am running a statistical profile on the [UnitPrice] and [UnitPriceDiscount] columns although its slightly annoying that I cannot have the results for both on screen at the same time. I have posted a connect submission asking for that to be changed which you can find here.

This concludes my brief summary of the Column Statistics profile request in the Data Profiling Task that is being introduced with SQL Server Integration Services 2008. For more posts in this series please refer to SSIS: Data Profiling Task: Part 1 - Introduction

-Jamie

 

Disclaimer: The information in this blog post is correct at the time of publication. It may have changed before SQL Server 2008 is released.

Published Sunday, March 02, 2008 10:49 PM by jamie.thomson

Comments

 

oriental medicine blog » Blog Archive » SSIS: Data Profiling Task: Part 5 - Column Statistics said:

March 2, 2008 11:50 PM
 

SSIS Junkie said:

In this blog entry I am going to outline the function of the Column Value Distribution profile request

March 3, 2008 2:19 AM
 

SSIS Junkie : SSIS: Data Profiling Task: Part 1 - Introduction said:

March 3, 2008 2:21 AM
 

useful dates said:

March 4, 2008 11:45 PM
 

Douglas Laudenschlager said:

Microsoft Research contributed the Data Profiling Task and Data Profile Viewer to Integration Services

January 20, 2009 11:46 PM
New Comments to this post are disabled

This Blog

Syndication

Powered by Community Server (Personal Edition), by Telligent Systems