Header-53.jpg

INSIGHTS

Physical Database Partitioning in Large Database Design

Database partitioning is not a new technology, but one that is often overlooked in the design of large databases. Partitioning is simply a way of using separate physical storage locations for a single database object. Before partitioning was implemented by database vendors, this was done by creating separate tables for current and historical data, using views to combine the “partitioned” data, etc.

There are a variety of reasons to partition tables and indexes, but this document will focus on the use of partitions in ODS and Data Warehouse databases.  In these environments, partitioning is primarily used for long-term performance stability and ease of maintenance.

The first step in designing a partitioned table is to choose the partition key.  This is a specific column (or columns) that will be used to separate the physical storage of rows in the table.  For ease of maintenance, a date-based partition key is typically selected. 

In general, there are two types of reports. “Snapshot” based reports rely on the state of the data at a specific point in time.  Typically, this is end of day, end of week, end of month, end of quarter, or end of year, depending on the frequency of report execution. A balance sheet is an example of a “Snapshot” based report.  “Transactional” based reports rely on specific events that occur within a period of time. They are distinguished from “Snapshot” based reports, as they have a begin date and an end date, and report all activity that occurred within the date range.  Again, the range of dates depends on the frequency of execution, (daily, weekly, monthly, etc.).  A bank statement is a good example of a “Transaction” based report, (although the balance column is really a “Snapshot” state of your account immediately after the transaction occurred).

Choosing a date-based partition key simplifies maintenance in that a partition can be easily moved (to 2nd tier storage, for example), or dropped.  Dropping a partition, in particular, is a much easier and faster operation than deleting records from a non-partitioned table that meet specific date-based criteria. This strategy assumes that older data is less frequently reported on, or is completely unnecessary, based on a business-defined retention period.

Long-term performance implications are equally, if not more important, than maintenance considerations when choosing a partition key. If the partition key column (or columns) is not part of the WHERE clause, partition elimination does not occur. Partition elimination is fundamental to performance, as it allows the optimizer to query a specific partition (or partitions) of a table, rather than the entire table. Fortunately, the frequency of a report typically dictates a date range that aligns closely, if not perfectly, with maintenance-based considerations. 

It is very common for reporting applications (Cognos, for example), to prompt the user for a specific date for “Snapshot” based reports.  For “Transaction” based reports, the user is typically prompted for the end date, and a start date calculated from the selected end date to limit to a specific range of dates. This can be very important, depending on the volume of data.  For example, when querying a very large ODS for stock market quote data, it is likely that the partitioning strategy will be daily, (most likely sub-partitioned), and the reporting application will require additional filtering, beyond the day, to perhaps a single symbol.  In a large ODS that contains individual stock quotes, a single day could have billions of quotes across all symbols, likely millions for a single stock symbol. The start date (time), could be limited to an hour or less, based on the end date (time) and expected number of rows. This example of extremely large volumes of data illustrates why it is important to partition indexes as well.

There are several types of indexes that can be created on a partitioned table.  Global (non-partitioned) indexes are generally undesirable on partitioned table. They will continue to grow over time, and the optimizer cannot eliminate table partitions when accessing these indexes. In some RDBMS engines, you can partition an index on a different column (or columns) than the table is partitioned on. While this can result in partition elimination on the index, it does not guarantee any table partitions will be eliminated by the optimizer.

Local partitioned indexes are preferred, because they are partitioned on the same column (or columns) of the underlying table. Partition elimination on such indexes guarantees the same partition elimination on the table. However, there are restrictions on local partitioned indexes.  A unique index can only be local partitioned if it contains the partition key of the table. This where the ODS or Data Warehouse model diverges from its source and it must be determined if the source primary key is unique over all time.  If the source primary key is guaranteed to be unique over time, it may be used in conjunction with an appropriate date (used in the partition key) to define primary key in the reporting system, resulting in a local partitioned index.  

This overview of physical partitioning is intended to be database-agnostic.  The available partitioning types (and sub-partitioning types), limitations, and restrictions are RDBMS-specific.  The following reference links can be used for specific information for three of the largest traditional RDBMS vendors.

References:

Oracle 12c Partitioning

SQL Server 2012 Partitioning

DB2 10.5 Partitioning

 

For Additional Blog Post please check out NVISIA's Enterprise Data Management Page

Topics: Tips, Data Management

Written by Greg Goleash

Greg is a Sr. Data Architect at NVISIA

Leave a Comment