Header-53.jpg

INSIGHTS

Dimensional Modeling Basics (Part 1)

There have been entire books and methodologies dedicated to dimensional modeling.  This is not intended to expand or endorse any particular methodology, but to give a brief overview of dimensional modeling techniques, for those who are not familiar with them. Part 2 (to be published at a later date) will provide additional details on “levels” in dimensions, as well as practical guidance on when and how to use these modeling techniques (as well as when not to use them).

Dimensional modeling has only two types of tables: dimensions and facts.  Facts contain only foreign keys to dimensions and measures, which are values that are aggregated across the combination of dimensional keys that exist in the fact. For example, a sales fact table may contain “measures” such as units sold and sales amount (measured in currency). In its simplest form, dimensional modeling results in what is commonly referred to as a “star schema”, as opposed to various forms of “snowflake schema”.  The simple examples that follow are based on a basic “star schema” design.

Dimensions can be modeled in various forms or “types”, in theory.  The two most common types are “Type 1” static dimensions and “Type 2” slowly changing dimensions.   Changes to “Type 1” static dimensions are treated as corrections, rather than different “versions” of the same record.  These dimensions are updated in place, or new values are added.  “Type 2” slowly changing dimensions, on the other hand, are treated as separate states “or versions” that can change over time.

The most basic and always necessary example of a “Type 1” static dimension is time.  This is the most common dimension in any dimensional model and needs to be very static.  It should contain all values for historical and future dates.  In lieu of populating it with years into the future, a simple maintenance procedure can be written to add values to the time dimension slightly before they will be used. An example of the logical representation of time may be:

TIME_DIM_ID                        int identity

DATE_VALUE                          date

MONTH                                     varchar

QUARTER                                 varchar

YEAR                                           int

 

This assumes that the fact or fact(s) are aggregated by day.  The values are easily generated, but assume the following changes as corrections (to the data and the script that populates the dimension):

 

Original:

TIME_ID                  DATE_VALUE        MONTH                   QUARTER               YEAR

            1     2013-01-01       Jan2013               Q1                              2013

         366   2014-01-01        Jan2014              Q1                              2014

         731   2015-01-01      Jan2015                Q1                              2015

 

Changed:

TIME_ ID                 DATE_VALUE        MONTH                   QUARTER               YEAR

             1                   2014-01-01       January 2013     2013 Q1                2013

         366                  2014-01-01         January 2014     2014 Q1                2014

        371                  2015-01-01         January 2015     2015 Q1                2015

 

This correction may be necessary for proper formatting and sort order in reporting.  It is not necessary to “version” such data, as it has no impact on any fact tables using this dimension.  The keys are unchanged, only dimensional values are corrected to satisfy formatting and sorting requirements. In this example, the TIME_DIM_ID may be an identity column or sequence generated number.  It has no meaning.  However, it might be the integer representation of a date in ISO format, such as 20140101. 

“Type 2” slowly changing dimensions are treated differently when updates occur, because changes must be accurately reflected in the related facts.  There are several ways to represent “Type 2” dimensions.  For example, assume that the product dimension is based on new car sales, and is logically modeled as:

PRODUCT_ID                         int identity

YEAR                                           smallint

MODEL                                      varchar

MAKE                                         varchar

TYPE                                            varchar

START_DATE                           date

END_DATE                               date

 

Changes to attributes in this type of dimension require new keys and versioning (based on begin_date and end_date in this simple example).  When building a new schema, unknown start dates are often represented as existing from the beginning of time.  Unknown end dates are often represented as the end of time.  These values will vary based on the database vendor, but consider the following values as an example of a “Type 2” change, with this state as of 2015-01-01:

 

PRODUCT_ID       YEAR       MODEL                    MAKE                       TYPE                          BEGIN_DATE        END_DATE

                     1            2014      Jetta                        Volkswagon        Sedan                      2013-10-01         9999-12-31

                     2      2015            Jetta                        Volkswagon        Sedan                      2014-10-01         9999-12-31

                     3            2015      Touareg                 Volkswagon        SUV                           2014-10-01         9999-12-31

 

All changes are versioned. Assume that there are changes and additions to this product dimension on 2015-02-01, they would be reflected as below:

 

PRODUCT_ID       YEAR       MODEL                    MAKE                       TYPE                          BEGIN_DATE        END_DATE

                     1            2014      Jetta                        Volkswagon        Sedan                      2013-10-01         2014-01-31

                     2      2015            Jetta                        Volkswagon        Sedan                      2014-10-01         2014-01-31

                     3            2015      Touareg                 Volkswagon        SUV                           2014-10-01         2014-01-31

                     4      2015     Jetta SE       Volkswagon        Sedan                      2015-02-01         9999-12-31

                         5        2014      Touareg V6          Volkswagon        SUV                           2015-02-01         9999-12-31

                         6        2015      Jetta TDI S            Volkswagon        Sedan                      2105-02-01         9999-12-31        

What do these changes represent? For product id 1, it is no longer being sold by the business.  Product id 2 and 3 were not properly classified, so old versions were “closed off” and new versions were created.  In other words, the 2015 model values were not specific enough in the original data, and have been made more specific going forward.  The changes in ID (the primary key) result in the proper representation reference in facts based on this dimension.  All fact rows prior to 2015-02-01 will reference the old versions (id from 1-3), while all fact rows from 2015-02-01 will reference the new versions (id 4-6).

 

For purposes of example, consider a third dimension, also “Type 2”,  based on location within the US, with the following structure:

 

LOCATION_ID                        int identity

LOCATION_NAME               varchar

LOCATION_CITY                   varchar

LOCATION_STATE                varchar

LOCATION_REGION           varchar

 

Assume further that a fact table is created, using only these three dimensions, containing only two facts or “measures”:

 

TIME_ID                                    int                              -- references the time dimension

PRODUCT_ID                         int                              -- references the product dimension

LOCATION_ID                        int                              -- references the location dimension

UNITS_SOLD                          int                              -- a fact

SALES_AMOUNT                  money                    -- a fact

 

Fact tables designed this way assume that the combination of dimensional keys are unique (which can be physically enforced) and that the “Facts” are aggregate numbers, aggregated over the combination of dimensions.  This simple design can answer many questions at a high level:

 

  • What is my best/worst selling product, this day, this month, this year?
  • What location generates the most/least new car revenue, this day, this month, this year?

 

These questions can be asked at any level of detail within the dimensions.  Time was included in the example, but consider the attributes (or levels, to be detailed in later installments) available in the product and location dimensions. The same questions can be addressed at the year/model/specific location level, or at a much higher level, such as vehicle type / region level.  The example data may be hundreds or thousands of rows in each dimension, depending on the size of the business.

 

This type of design is not new and it wasn’t new in the mid-90’s when the terminology around “dimensions” and “facts” was invented.  For further understanding of the terminology and methodology related to this type of modeling, please refer to the following references from the “founding fathers” of dimensional modeling:

 

http://www.kimballgroup.com

http://www.inmoncif.com/

 

For Additional Blog Post please check out NVISIA's Enterprise Data Management Page

Topics: Data Management

Written by Greg Goleash

Greg is a Sr. Data Architect at NVISIA

Leave a Comment