Who are you calling a “junk” dimension?

nvisia is an award-winning software development partner driving competitive edge for clients.


Many times in data warehouse designs, one encounters a bunch of low cardinality (think less than 4 values) attributes (e.g. transactional codes, flags, or text attributes) that are unrelated to any particular dimension. There are a few options to deal with these: 

    1. Add them to the Fact (very inefficient for storage and performance [if it causes data size to cross pages]) 
    2. Create dimensions for each one (clutters up the model & offers no performance advantage) 
    3. Create a “junk” dimension to hold these odd assortment of unrelated attributes (keeps the model clean, reduces storage, and provides better performance) 


We’ll explore the third option a bit further. So we add all the permutations (ie Cartesian product) of all the junk attributes to the junk dimension.  It is worth noting that these attribute values are fairly static (don't change very often).  Some examples include 'status', yes/no flags, types and categories.  In addition to keeping the model clean, adding new low-cardinality attributes becomes much easier. 

Consider the following example:

Order_fact
==========
order_id
date_submitted_key
date_fulfilled_key
date_delivered_key
late_delivery_ind_key
partial_fulfillment_ind_key
customer_key
priority_delivery_ind_key
customer_loyalty_catg_key [gold, silver, bronze, none]

... where you have 4 dimensions (late_delivery, partial_fulfillment, priority_delivery, customer_loyalty_catg) with very low cardinality data values

I propose a better approach...

Order_fact
==========
order_id
date_submitted_key
date_fulfilled_key
date_delivered_key
customer_key
junk_key

Junk_dim
==============
junk_key 
late_delivery_ind                         [yes/no] - 2 values
partial_fulfillment_ind         [yes/no] - 2 values
priority_delivery_ind                 [yes/no] - 2 values
customer_loyalty_catg                 [gold, silver, bronze, none] - 4 values

The Cartesian product of values is 32 (2 * 2 * 2 * 4) for the junk dimension.

The thinking behind performance gains of a junk dimension relate to the inefficiency of low cardinality dimension joins and the number of low cardinality data sets. First, low cardinality attributes are not properly supported by a normal B tree index. If Bitmap indexes are available in your RDBMS implementation, I would highly suggest you use one here. Low cardinality dimension joins are treated like nested loop joins, which, when combined with other nested loop joins are a performance disaster. Instead a Cartesian product join, utilized by junk dimension joins are far more efficient. 


References:
Definition of a junk dimension (
http://en.wikipedia.org/wiki/Dimension_(data_warehouse)#Junk_dimension)

More detailed explanation of Cartesian product joins  (
https://analyticsreckoner.wordpress.com/2012/07/24/modelling-tip-how-junk-dimensions-helps-in-dw-performance/)

For Additional Blog Post please check out NVISIA's Enterprise Data Management Page 

Related Articles