- Tags
- Article
- Data Management
Dimensional Modeling Basics (Part 1)
There have been entire books and methodologies dedicated to dimensional modeling. This is not intended to expand or endorse any particular methodology, but to give a brief overview of dimensional modeling techniques, for those who are not familiar with them. Part 2 (to be published at a later date) will provide additional details on “levels” in dimensions, as well as practical guidance on when and how to use these modeling techniques (as well as when not to use them).
Dimensional modeling has only two types of tables: dimensions and facts. Facts contain only foreign keys to dimensions and measures, which are values that are aggregated across the combination of dimensional keys that exist in the fact. For example, a sales fact table may contain “measures” such as units sold and sales amount (measured in currency). In its simplest form, dimensional modeling results in what is commonly referred to as a “star schema”, as opposed to various forms of “snowflake schema”. The simple examples that follow are based on a basic “star schema” design.
Dimensions can be modeled in various forms or “types”, in theory. The two most common types are “Type 1” static dimensions and “Type 2” slowly changing dimensions. Changes to “Type 1” static dimensions are treated as corrections, rather than different “versions” of the same record. These dimensions are updated in place, or new values are added. “Type 2” slowly changing dimensions, on the other hand, are treated as separate states “or versions” that can change over time.
The most basic and always necessary example of a “Type 1” static dimension is time. This is the most common dimension in any dimensional model and needs to be very static. It should contain all values for historical and future dates. In lieu of populating it with years into the future, a simple maintenance procedure can be written to add values to the time dimension slightly before they will be used. An example of the logical representation of time may be:
TIME_DIM_ID int identity
DATE_VALUE date
MONTH varchar
QUARTER varchar
YEAR int
This assumes that the fact or fact(s) are aggregated by day. The values are easily generated, but assume the following changes as corrections (to the data and the script that populates the dimension):
Original:
TIME_ID DATE_VALUE MONTH QUARTER YEAR
1 2013-01-01 Jan2013 Q1 2013
…
366 2014-01-01 Jan2014 Q1 2014
…
731 2015-01-01 Jan2015 Q1 2015
Changed:
TIME_ ID DATE_VALUE MONTH QUARTER YEAR
1 2014-01-01 January 2013 2013 Q1 2013
…
366 2014-01-01 January 2014 2014 Q1 2014
…
371 2015-01-01 January 2015 2015 Q1 2015
This correction may be necessary for proper formatting and sort order in reporting. It is not necessary to “version” such data, as it has no impact on any fact tables using this dimension. The keys are unchanged, only dimensional values are corrected to satisfy formatting and sorting requirements. In this example, the TIME_DIM_ID may be an identity column or sequence generated number. It has no meaning. However, it might be the integer representation of a date in ISO format, such as 20140101.
“Type 2” slowly changing dimensions are treated differently when updates occur, because changes must be accurately reflected in the related facts. There are several ways to represent “Type 2” dimensions. For example, assume that the product dimension is based on new car sales, and is logically modeled as:
PRODUCT_ID int identity
YEAR smallint
MODEL varchar
MAKE varchar
TYPE varchar
START_DATE date
END_DATE date
Changes to attributes in this type of dimension require new keys and versioning (based on begin_date and end_date in this simple example). When building a new schema, unknown start dates are often represented as existing from the beginning of time. Unknown end dates are often represented as the end of time. These values will vary based on the database vendor, but consider the following values as an example of a “Type 2” change, with this state as of 2015-01-01:
PRODUCT_ID YEAR MODEL MAKE TYPE BEGIN_DATE END_DATE
1 2014 Jetta Volkswagon Sedan 2013-10-01 9999-12-31
2 2015 Jetta Volkswagon Sedan 2014-10-01 9999-12-31
3 2015 Touareg Volkswagon SUV 2014-10-01 9999-12-31
All changes are versioned. Assume that there are changes and additions to this product dimension on 2015-02-01, they would be reflected as below:
PRODUCT_ID YEAR MODEL MAKE TYPE BEGIN_DATE END_DATE
1 2014 Jetta Volkswagon Sedan 2013-10-01 2014-01-31
2 2015 Jetta Volkswagon Sedan 2014-10-01 2014-01-31
3 2015 Touareg Volkswagon SUV 2014-10-01 2014-01-31
4 2015 Jetta SE Volkswagon Sedan 2015-02-01 9999-12-31
5 2014 Touareg V6 Volkswagon SUV 2015-02-01 9999-12-31
6 2015 Jetta TDI S Volkswagon Sedan 2105-02-01 9999-12-31
What do these changes represent? For product id 1, it is no longer being sold by the business. Product id 2 and 3 were not properly classified, so old versions were “closed off” and new versions were created. In other words, the 2015 model values were not specific enough in the original data, and have been made more specific going forward. The changes in ID (the primary key) result in the proper representation reference in facts based on this dimension. All fact rows prior to 2015-02-01 will reference the old versions (id from 1-3), while all fact rows from 2015-02-01 will reference the new versions (id 4-6).
For purposes of example, consider a third dimension, also “Type 2”, based on location within the US, with the following structure:
LOCATION_ID int identity
LOCATION_NAME varchar
LOCATION_CITY varchar
LOCATION_STATE varchar
LOCATION_REGION varchar
Assume further that a fact table is created, using only these three dimensions, containing only two facts or “measures”:
TIME_ID int -- references the time dimension
PRODUCT_ID int -- references the product dimension
LOCATION_ID int -- references the location dimension
UNITS_SOLD int -- a fact
SALES_AMOUNT money -- a fact
Fact tables designed this way assume that the combination of dimensional keys are unique (which can be physically enforced) and that the “Facts” are aggregate numbers, aggregated over the combination of dimensions. This simple design can answer many questions at a high level:
- What is my best/worst selling product, this day, this month, this year?
- What location generates the most/least new car revenue, this day, this month, this year?
These questions can be asked at any level of detail within the dimensions. Time was included in the example, but consider the attributes (or levels, to be detailed in later installments) available in the product and location dimensions. The same questions can be addressed at the year/model/specific location level, or at a much higher level, such as vehicle type / region level. The example data may be hundreds or thousands of rows in each dimension, depending on the size of the business.
This type of design is not new and it wasn’t new in the mid-90’s when the terminology around “dimensions” and “facts” was invented. For further understanding of the terminology and methodology related to this type of modeling, please refer to the following references from the “founding fathers” of dimensional modeling:
For Additional Blog Post please check out NVISIA's Enterprise Data Management Page