Welcome to EMC Consulting Blogs Sign in | Join | Help

Christian Wade's Blog

Unknown Member and Analysis Services 2005

I once wrote an add-on for the SQL Server Accelerator for Business Intelligence to provide unknown-member functionality.  I was therefore naturally interested when I heard that it comes out-of-the-box with Analysis Services 2005.


The idea with the unknown member is that, for each level of a hierarchy (or for each attribute in Analysis Services 2005), there is an unknown member.  Members at a lower level that cannot be resolved (no matching parent), or fact table rows that cannot be resolved, are mapped to the unknown bucket.


Why would we want to do this?


As we often gather data from various sources, it will sometimes not be fully synchronised.  Let’s say we have a fact table that is dimensioned by 20 dimension tables.  Some fact data happens to come in that is resolved to 19 of the dimensions, but not to one less important dimension.  This might be because the dimension gets data from a different source to the fact table and it has not yet been refreshed.  When browsing the cube by any of the 19 dims, we cannot tell that another dimension is unresolved.  When browsing by the unresolved dim, the data appears under the unknown bucket.  Then, 1 or 2 ETL cycles later, the dimension is refreshed; the facts are fully resolved and relate to the correct member.


Let’s say the fact table has a Sales measure.  If we do not allow for the unknown member, we will get incorrect sales amounts across all 20 dimensions because not all the sales data is accounted for.


The reasons given by Ralph Kimball for using integer surrogate keys also highlight the value of the unknown member.  We cannot always rely on the consistency of database IDs from production.  This can result in unmatched fact-table rows for certain dimensions.  The measures in the cube can then still take account of these rows until the discrepancies are resolved.


Utilising the unknown member results in a more robust ETL.  It can avoid errors that are time consuming to resolve and cause an ongoing management overhead.


How would we implement this?


We would typically have a foreign key relationship in the relational database between a fact table and each dimension table.  On the face of it, it would therefore be impossible to insert fact table rows that cannot be resolved to all referenced dimensions.


Whether we follow Kimball’s design methodology on integer surrogate keys affects how we deal with the unknown member.  I will cover both cases.


Without Integer Surrogate Keys


In Analysis Services 2005, we define the foreign key constraint logically in the DSV.  It is not propagated back to the relational source.  (The relational source could optionally have a disabled foreign key constraint representing this relationship.)  We can therefore insert fact table rows that are not necessarily referenced by the parent dimension.  In this case the unknown member will just ‘work’.


A number of the problems outlined by Kimball regarding not using surrogate keys are overcome by Analysis Services 2005 providing out-of-the-box unknown member functionality.  For example, if dimension keys come in at fact level that are not formatted as expected, they will reside under the unknown bucket until the discrepancy has been resolved.


The biggest value-add resulting from the out-of-the-box Analysis Services 2005 unknown-member functionality applies to OLAP systems that do not use integer surrogate keys.


Dim Properties in Analysis Services 2005


The following screenshot shows a couple of dimension properties in Analysis Services 2005 for the unknown member.


AS 2005 dim properties


With UnknownMember set to Visible, we get the behaviour described above.  Hidden means the unknown member is not visible, but the aggregations still take account of it.  None means the unknown member is not used.


The UnknownMemberName property just allows us to change the name of the unknown member from the default of ‘Unknown’.


With Integer Surrogate Keys


However, I am in the integer surrogate-key ‘camp’.  In this case, we can create a physical unknown member by having a dimension-table row with an ID of -1.  Then, as part of the ETL, any unresolved fact-table rows can be mapped to this special dimension row.


Integer surrogate keys inherently provide a structure that lends itself to the unknown member.


With Analysis Services 2005, we don’t actually need to bother with the physical dimension member (dimension row with ID of -1).  Since foreign key relationships allow nulls in the child table even if no parent row has a null key (null doesn’t equal null anyway), we can have the foreign-key field in the fact table holding a null value for unresolved rows.  The out-of-the-box Analysis Services 2005 functionality will automatically map such rows to the unknown bucket.


Whether we use null in the fact table or a physical dimension member doesn’t really make much difference.  I will show an example that uses a physical dimension member, but the similar logic would be applied in either case.

Unknown member example

We have 2 tables: the product dimension table (DimProd) and the sales fact table (FactSales).  The ID from the transactional system is held in DimProdCode for reference.


In order to allow for the unknown member, we will need to store the product ID from the transactional system in the fact table as well (although we can isolate this to the fact table for the ‘unknown partition’ – see below).  During the current ETL cycle, there is a fact with an ID from the transactional system of ‘JKL’.  As there is no corresponding product for this row, it has a DimProdID value of -1.


At the next ETL cycle, the product with transactional ID of ‘JKL’ comes in and the dim table is refreshed.  The ETL must then update the -1 value with the newly generated value for DimProdID.  After cube processing, there would be nothing in the unknown member.


Partitioning for Unknowns


If we are dealing with large volumes and minimising the cube-processing window is important to us, we can take advantage of measure-group partitions.  Let’s say we want to incrementally process our cubes in order to save time.  Given the dynamic nature of the data in the fact table due to the unknown member, we have to perform full processing unless we partition accordingly.  Let’s use the above (very simple) example to illustrate how we would do this.
 

 
 

Unknown member partitioning

We are splitting up the fact table into 2 tables; one to hold fully resolved facts and the other to hold facts that are unresolved to at least one dimension.  This allows us to incrementally process the main partition for FactSales (which would typically hold the vast majority of the data), and fully process ‘unknown partition’ for FactSales_Unk.  This could save a big chunk of cube processing time.


Note that FactSales no longer requires the DimProdCode column because all rows can be looked up against the DimProd table.


ETL-Cycle Process with Unknown Partitioning


Let’s follow through a possible sequence of events for a group of fact table rows.  Again, our hypothetical fact table is dimensioned by 20 dimensions.

ETL Cycle 1

Facts come in that are unresolved for 3 of the 20 dimensions.  They are therefore placed in the unknown partition.

ETL Cycle 2

The dimension data is refreshed and there is now only a single unresolved dimension for the facts.  Some of the -1 values are updated accordingly, but the data remains in the unknown partition.

ETL Cycle 3

The dimension data is again refreshed and all 20 dimensions are now fully resolved.  From this point, the data becomes much more static, so it is deleted from the unknown partition and incrementally loaded into the main partition.


Conclusion


I think the unknown member is pretty useful.  Setting it up properly for large volumes with incremental processing can result in a little extra development time.  This extra setup is obviously not required if the cubes are fully processed.

The out-of-the-box Analysis Services 2005 functionality for the unknown member is primarily geared to OLAP systems that do not use integer surrogate keys.


The unknown member can result in a more robust ETL.  Not to mention the fact that it can prevent incorrect numbers being displayed to the users (God forbid!).



Comments

 

christian.wade said:

Christian,
I'm of the opinion that an Unknown dimension member should be handled in the ETL. It should not be up to AS to handle it.

What do you think?

-Jamie
April 2, 2005 18:21
 

christian.wade said:

Jamie,

That is a fair comment. Using integer surrogate keys it is effectively handled in the ETL; whether using -1 or null in the fact table.

Christian
April 2, 2005 23:47
 

christian.wade said:

Jamie,

I modified the example slightly to better illustrate that the emphasis is on the ETL when using integer surrogate keys.

Christian
April 4, 2005 11:29
 

TrackBack said:

Link to Unknown Member Blog
April 7, 2005 16:15
 

christian.wade said:

The BI/AS2K project I have been involved with for the last 18 months handles 'unknowns' in the same way as under the heading "With Integer Surrogate Keys". We don't do the "partitioning for unknowns" thing, but only because we never thought of it. It seems like a pretty cool idea to me.
All of our fact tables have both the transaction system key (let's assume there's only one dimension - 'product' with say, 4 levels, Dept., Category, Section, Product) and the surrogate key. In fact, due to the fact we use a partitioned view over several tables (requiring a check constraint on one of the primary key columns, and NO Identity Column) our primary key for the fact table(s) uses the transaction system key, not the surrogate key.
This is the only way we could get it to work in a situation where we get an 'unknown' product (ie where the product does not exist in our dimension table) - as we may get several unknown products that would all have the -1 surrogate key.
I was describing this situation to a colleague (more learned than I) who is at another site. He was shocked to hear we have the transaction system key in our fact tables, and he nearly died when I told him they form the primary key. I explained the set-up to him, and he could see what was happening. He then said that we have done a bad implementation of a good idea (ie unknown = -1). He proceeded to tell me that true unknowns are where you don't even get a transaction system key. As long as you get SOMETHING, then it is at least partially known, and should be handled as such. This can arise from systems where data entry is very 'free-text'.
He said in the ETL we should build our Product dimension as usual, then bring in the fact data, scan the facts to determine what are the 'semi-known' products, put them in a DimProduct_UNK table, THEN finally resolve the surrogate product key in the fact table(s) by referencing both the regular DimProduct table, and DimProduct_UNK. This approach results in NOT having to keep the transaction system key in the fact table, AND allows report users to see "what is in the unknown bucket" (a question we often get asked). If they can see it, then they can fix it themselves, by going to the product set-up screen in the transactional system and putting it into the correct heirarchy. So tomorrow (or the next ETL cycle) this 'unknown' product will be extracted and next time the user opens their sales report, the product will be categorised properly. The only 'unknowns' left will be where no product info was received at all from the transactional system. This approach does of course come with an ETL processing time overhead, but nothing comes for free, does it?
What do you guys think of this approach? Please note that we haven't actually done it here, as we have no time for re-work of this magnitude. In future however, I intend to work this idea in to my designs. Unless of course you can see major flaws...
Thanks,
Matt.
April 12, 2005 00:27
 

christian.wade said:

Matt,

Kimball suggests that the fact table primary key should be the concatenation of the surrogate foreign keys referencing the dimension tables. If understand correctly, you are saying you can't do this because multiple unknown member records might violate entity integrity. If you really cannot use the surrogate foreign keys as the fact table primary key, I would not resort to using the transactional keys for this purpose. For the same reason as why we need an unknown member, the transactional keys too could violate entity integrity.

Regarding allowing the users to see "what is in the unknown bucket", I’m assuming you mean allowing the users can see the unresolved keys whilst browsing rather than just an unknown member at the lowest level. In this case the member names would be the 'free-text' key values rather than the actual product names. This would be inconsistent with the other (resolved) products. An alternative would be to create a relational report to provide info on any unknown member's transactional-system keys. This might be simpler and less work than creating a dimension table row for each unresolved fact table row.

The approach I described is geared to common unknown-processing requirements. There are other approaches depending on specific requirements. For example, Donald Farmer describes the case where fact table keys cannot be resolved due to business logic rather than technical issues: http://sqljunkies.com/WebLog/donald%20farmer/archive/2005/04/03/10148.aspx.

You may have some very specific requirements that necessitate a different approach.

HTH,
Christian

April 12, 2005 10:42
 

christian.wade said:

One thing that helped me deal with this situation is to understand that there are (at least) two kinds of unknowns: The kind that we know that we'll never know, and the kind that we expect to know in the future.

The "never know" should be handled by the -1 key. The "know in the future" is an example of early-arriving facts: http://www.kimballgroup.com/html/designtipsPDF/KimballDT57EarlyArriving.pdf

The idea is that you "invent" the dimension record with the rogue business key, and then apply a type 1 change to it when (if) the real record appears.

-Barclay
April 12, 2005 15:22
 

christian.wade said:

Barclay,

Thank you for the comment and the link. It was an interesting read.

Yes, if we are allowing for slowly-changing dimensions, there are other implications.

If we know for certain that the reason a fact row is unknown in the current ETL cycle is that it will definitely receive its dim data in the future, that is a better way to deal with it.

The question is, how do we (or more precisely, how does our code) make the call that a fact row falls under this category? If we know that certain fact tables can only ever receive unknowns that fall into this category, then great; let's use the Type 1 overwrite as described in the document.

As I mentioned to Matt, the approach I described is intended to be as generic as possible. Other approaches allow for more specific requirements.

Christian

April 12, 2005 17:20
Anonymous comments are disabled
Powered by Community Server (Personal Edition), by Telligent Systems