In-memory NoSQL?

on 5/18/16 1:25 PM By | Mike Vogt | 0 Comments | Data Management
Great post explaining what in-memory NoSQL stores are...  http://hazelcast.org/use-cases/in-memory-nosql/  Although this information is published by Hazelcast, it conveys generally how in-memory data grids (IMDG) work.
Read More

[DAMA Chicago] Ensuring your data lake doesn’t become a data swamp

on 2/18/16 5:02 PM By | Mike Vogt | 1 Comment | Big Data data lake Data Management
Here is my presentation from the 2/17/2016 DAMA Chicago meeting. 
Read More

DAMA Chicago Meeting—February 17th, 2016

on 2/16/16 7:56 PM By | Mike Vogt | 0 Comments | Data Quality Big Data Data Management
MEETING AGENDA 8:30 a.m. Continental Breakfast sponsored by NVISIA 9:00 a.m. DAMA Chicago Business Meeting 9:30 a.m. Dr. Sanjay Shirude, Transforming an Application-based IT Organization into a Datadriven Service Provider 11:30 a.m. Lunch 1:00 p.m. Michael Vogt, Ensuring Your Data Lake Doesn't Become a Data Swamp 2:30 p.m. NVISIA presentation 3:00 p.m. Raffle drawing & Adjournment Location:  nielsen
Read More

DAMA Chicago Meeting—December 9th, 2015

on 11/24/15 2:45 PM By | Mike Vogt | 0 Comments | Data Quality Big Data Data Management
MEETING AGENDA 9:00 a.m. Business Meeting & Announcements 9:30 a.m. Scott Rudenstein – “WANdisco - Data Movement for globally deployed big data Hadoop architectures” 11:15 a.m. “Shaun Malott –Northern Trust - Information Quality Analysis and remediation : A case Study” 12:15 p.m. Lunch 1:45 p.m. Michael G. Miller – The Future of Data Governance 2:45 p.m. Raffle Drawing 3:00 p.m. Adjournment Location:  NORTHERN TRUST 50 South Lasalle Street Cicago,IL 60603 GLOBAL CONFERENCE CENTER– MIAMI ROOM     Morning Presentations Data movement for globally deployed Big Data Hadoop architectures. Speaker: Scott Rudenstein Over the past few years, Hadoop has quickly moved to the production data center for storing and processing Big Data, and is now widely used to support mission-critical applications. One of the challenges to an organization is the adoption of multi-data center Hadoop, where data needs to flow between environments that are in the metropolitan area or thousands of miles apart. The problems related to operating Hadoop across the WAN can be broadly divided into data relevancy, continuous availability, risk reduction and recovery. The challenge we’ll focus on is to keep data flowing consistently in the face of network, hardware, and human failures. Eliminating downtime and data loss is critical for any application having stringent service-level agreements (SLAs) and regulatory compliance mandates associated with it, by providing the lowest-possible recovery point objective (RPO) and recovery time objective (RTO). Bio: Scott Rudenstein has worked in commercial software sales for 18 years and has an extensive background in Application Lifecycle Management and High Performance Computing. Throughout his career in US and UK, he is specialized in replication, where data and environments need high availability, disaster recovery and backup capabilities. As WANdisco's VP of Technical Services, Scott works with partners, prospects and customers to help them understand and evolve the requirements for mission critical/enterprise-ready Hadoop.   Northern Trust - Information Quality Analysis and remediation : A case Study” Speaker: Shaun Malott In this session, Shaun Malott will discuss the strategy and process used to identify and remediate a cross-system information quality issue that was originally selected as low-hanging fruit to demonstrate the power of Northern Trust's information quality tools. This quickly went awry as it unearthed a trove of issues providing valuable lessons learned and leveraging Embarcadero ER/Studio and IBM Infosphere Information Analyzer. Speaker: Shaun Malott Shaun Malott is a Vice President at The Northern Trust Company, Chicago. He serves as a Business Data Architect and a Data Steward for Wealth Management. He is responsible for the data foundation stream of the Partner Platform program including future state data requirements definition and working with technology to define and deliver the data architecture required to meet strategic partner, client and investment platform data quality requirements. Shaun actively participates in DAMA Chicago, DAMA International, and TDWI. He is a member of the Embarcadero ER/Studio Product Advisory Committee. Bio: Shaun Malott is a Vice President at The Northern Trust Company, Chicago. He serves as a Business Data Architect and a Data Steward for Wealth Management. He is responsible for the data foundation stream of the Partner Platform program including future state data requirements definition and working with technology to define and deliver the data architecture required to meet strategic partner, client and investment platform data quality requirements. Shaun actively participates in DAMA Chicago, DAMA International, and TDWI. He is a member of the Embarcadero ER/Studio Product Advisory Committee.
Read More

Data Lake or Data Swamp?

on 7/28/15 3:52 PM By | Nate Feldmann | 0 Comments | Big Data Data Management
Read More

Data Virtualization tools

on 6/22/15 2:30 PM By | Mike Vogt | 0 Comments | Data Management
Are they right for you?
Read More

Dimensional Modeling Basics (Part 2)

on 6/12/15 1:56 PM By | Greg Goleash | 0 Comments | Data Management
Read More

2015 Data Landscape

on 6/3/15 11:25 AM By | Mike Vogt | 0 Comments | Data Management
Read More

Dimensional Modeling Basics (Part 1)

on 5/28/15 4:27 PM By | Greg Goleash | 0 Comments | Data Management
There have been entire books and methodologies dedicated to dimensional modeling.  This is not intended to expand or endorse any particular methodology, but to give a brief overview of dimensional modeling techniques, for those who are not familiar with them. Part 2 (to be published at a later date) will provide additional details on “levels” in dimensions, as well as practical guidance on when and how to use these modeling techniques (as well as when not to use them).
Read More

Cassandra Day 2015

on 4/30/15 11:26 AM By | Art Ferrera | 0 Comments | Architecture & Design Tips Open Source Data Management
Read More

Apache Ignite Coding Examples Webinar by Dmitriy Setrakyan

on 4/15/15 1:59 PM By | Mike Vogt | 0 Comments | Data Management
Read More

Database Security by Design

on 3/6/15 4:43 PM By | Greg Goleash | 0 Comments | Data Management
Read More

If Everyone Owns Data, No One Owns It

on 2/25/15 3:59 PM By | Mike Vogt | 0 Comments | Data Management
  When I start with a new client, I almost always ask, "who owns the data?" The range of answers goes from corporate ownership to IT ownership to Data Management ownership to Sales ownership. When I clarify my question as who is accountable (ie head is on a stick) when there is a data related disaster, like a financial misstatement, the sound of people retreating from ownership is deafening -- and one is left. In a lot of organizations, Sales owns the 'core' customer, with Finance owning some specific financial customer attributes.
Read More

Is VPD a disease or a cure?

on 2/17/15 7:22 AM By | Greg Zambelli | 0 Comments | Enterprise Data Management Data Management
 During a recent project, the requirements included data security and access by users in different regional locations. Most users were able to access the data using the BI tool (OBIEE) for financial reports.  Another user had access using SQL Developer and could view all of the data in the system.  The worst scenario of them all was a user accessing the data with a VB script in Excel. :-)
Read More

Just Give Me the Factless Facts, Ma'am

on 2/4/15 2:03 PM By | Art Ferrera | 0 Comments | Data Management
We all know that in the world of dimensional modeling, the central table of a star schema is the fact table. This fact table contains both keys to join with dimensions as well as business measures. There are times, however, when measures are not needed and the fact only contain the pertinent keys. This we call a factless fact table. You are probably thinking oxymoron here but in reality, they do have their advantages.  There are two types of factless facts; events and coverage. The first type of factless fact table is a table that records an event. Many event-tracking tables in dimensional data warehouses turn out to be factless where no facts are associated with an important business process. Events or activities occur that you wish to track, but you find no measurements. In situations like this, build a standard transaction-grained fact table that contains no facts. Diagram 1 below depicts an example of an event tracking factless fact. This model shows award nominations for all historical entertainment awards such as the SAGs, Emmy's and of course the Oscars. There are no business measures supporting this model but just the tracking of nominations. Dimensions are the Actor/Actresses, Movies/Shows, the category of the nomination, the ceremony type and date of ceremony as well as the winning award if they have won. This model would answer a lot of interesting questions such as the following: What movie had the most nominations in the '65 Oscars? How many times did Jack Nicholson get nominated? How many awards did Breaking Bad win last year? The SQL to this model would simply be a SELECT COUNT(*). Alternatively, you can add an INT field in the fact that will always have a value of 1. In this case, a NOMINATIONS field would be added in the end of the fact and your SQL would be a SELECT SUM(NOMINATIONS) In another case, there may not be clear events or transactions but you want to support negative analysis. This is where a "coverage fact table" comes handy. Coined from Ralph Kimball, coverage fact tables are used to model conditions or other important relationships among dimensions. Take a look at Diagram 2. This model depicts car sales at a local car dealership. Based on the model, you can come up with top sales reps for any given time. The model, however, does not provide information on the biggest slacker based on sales effectiveness (customers sold / customers assigned) or if assignments are done evenly among the sales force. Diagram 3 depicts the Customer SalesRep Assignment. Notice the sales amount measure is pulled out of the fact table and the grain of the data is finer at the assignment level as opposed to actual sales. The difference between the two fact table results set would then answer the negative analysis questions. Diagram 2:   Diagram 3: In closing, we all want the facts...even if they are factless! http://www.kimballgroup.com/1996/09/factless-fact-tables/    
Read More

Who are you calling a “junk” dimension?

on 1/28/15 11:24 AM By | Mike Vogt | 0 Comments | Data Management
Many times in data warehouse designs, one encounters a bunch of low cardinality (think less than 4 values) attributes (e.g. transactional codes, flags, or text attributes) that are unrelated to any particular dimension. There are a few options to deal with these:      1. Add them to the Fact (very inefficient for storage and performance [if it causes data size to cross pages])      2. Create dimensions for each one (clutters up the model & offers no performance advantage)      3. Create a “junk” dimension to hold these odd assortment of unrelated attributes (keeps the model clean, reduces storage, and provides better performance)  We’ll explore the third option a bit further. So we add all the permutations (ie Cartesian product) of all the junk attributes to the junk dimension.  It is worth noting that these attribute values are fairly static (don't change very often).  Some examples include 'status', yes/no flags, types and categories.  In addition to keeping the model clean, adding new low-cardinality attributes becomes much easier.  Consider the following example: Order_fact ========== order_id date_submitted_key date_fulfilled_key date_delivered_key late_delivery_ind_key partial_fulfillment_ind_key customer_key priority_delivery_ind_key customer_loyalty_catg_key [gold, silver, bronze, none] ... where you have 4 dimensions (late_delivery, partial_fulfillment, priority_delivery, customer_loyalty_catg) with very low cardinality data values I propose a better approach... Order_fact ========== order_id date_submitted_key date_fulfilled_key date_delivered_key customer_key junk_key Junk_dim ============== junk_key  late_delivery_ind                         [yes/no] - 2 values partial_fulfillment_ind         [yes/no] - 2 values priority_delivery_ind                 [yes/no] - 2 values customer_loyalty_catg                 [gold, silver, bronze, none] - 4 values The Cartesian product of values is 32 (2 * 2 * 2 * 4) for the junk dimension. The thinking behind performance gains of a junk dimension relate to the inefficiency of low cardinality dimension joins and the number of low cardinality data sets. First, low cardinality attributes are not properly supported by a normal B tree index. If Bitmap indexes are available in your RDBMS implementation, I would highly suggest you use one here. Low cardinality dimension joins are treated like nested loop joins, which, when combined with other nested loop joins are a performance disaster. Instead a Cartesian product join, utilized by junk dimension joins are far more efficient.  References: Definition of a junk dimension (http://en.wikipedia.org/wiki/Dimension_(data_warehouse)#Junk_dimension) More detailed explanation of Cartesian product joins  (https://analyticsreckoner.wordpress.com/2012/07/24/modelling-tip-how-junk-dimensions-helps-in-dw-performance/) For Additional Blog Post please check out NVISIA's Enterprise Data Management Page 
Read More

Physical Database Partitioning in Large Database Design

on 1/19/15 1:48 PM By | Greg Goleash | 0 Comments | Tips Data Management
Database partitioning is not a new technology, but one that is often overlooked in the design of large databases. Partitioning is simply a way of using separate physical storage locations for a single database object. Before partitioning was implemented by database vendors, this was done by creating separate tables for current and historical data, using views to combine the “partitioned” data, etc. There are a variety of reasons to partition tables and indexes, but this document will focus on the use of partitions in ODS and Data Warehouse databases.  In these environments, partitioning is primarily used for long-term performance stability and ease of maintenance.
Read More

[Video] Quick Targeted Analysis using Talend's Open Studio for Data Quality

on 11/21/14 3:51 PM By | Mike Vogt | 0 Comments | Data Quality Data Management
Read More

Slow Changing Dimensions: Three is Just Not Enough

on 11/12/14 3:58 PM By | Mark Panthofer | 0 Comments | Data Management
When I put on my data modeler hat, I'm always thinking of the appropriate dimension to use when building out star schemas. For the most part, you have one of three common choices; Type 1, 2 or 3. I think we are all familiar with the definition of these types but let's just quickly review.  As you know, Type 1 is a Slowing Changing Dimension that overwrites the old data with new data. A product dimension with changing prices will always have the most current price. To track history, that is where you can utilize the Type 2 method. Historical records can be kept that typically includes effective dates and sometimes a current flag column. The Type 3 SCD has both the current attribute value as well as the last attribute value in a single record. This method limits historical records and preserves history based on columns as opposed to rows.  There is also a Type 0 method which is debatable as far as a slowly changing dimension. The value is always the original value inserted and changes are never performed. A date dimension is an example of type 0.  Ralph Kimball defines Type 4 as a slowly changing dimension used when a group of attributes in a dimension rapidly changes and is split off to a mini-dimension. Also called rapidly changing monster dimension. This method is advantageous for high volatile or frequently used attributes in very large size dimension. So if you know you have attributes that fit this category, split it off to its own physical mini-dimension table. The surrogate keys of both tables are capture as foreign keys to the fact table. An example is separating age band or income level attributes from the Customer base dimension and including in the mini-dimension table.  So these are the conventional types mostly used for dimension modeling. In the world of hybrid cars and hybrid golf clubs, there are also hybrid SCDs based primarily on the aforementioned methods. Each having its advantages and disadvantages.  Type 5 is a hybrid of 4 and 1 basically adding a Type 1 dimension table on top of the Type 4 design of base and mini dimensions. The Type 1 dim table is joined directly to the base dim without having to join directly to the fact table. This will hold current state attributes such as current age band or current income level for a customer. This constrains the ETL team to update this Type 1 table as opposed to the whole base table on any changes.  So now let's get really crazy. Are you ready? Let's put a type 3 attribute on top of a type 2 dimension and update it as a type 1. What do you get? You guessed it, Type 6 (3 + 2 + 1). One example that Kimball provided was tracking historical departments and current departments in a Product dimension. The Product dimension in itself is a type 2 dimension and historical data is preserved with historical records. The ETL strategy is standard processing for type 2. The current department field is the type 3 attribute and will only contain the current value or type 1. The ETL strategy is now different for this field. The ETL will need to update all historical rows for that product with the current department.  Get it? Got it? Good. Stay with me now as we've come to the last SCD type.  Type 7 is a different flavor of Type 6 providing similar functionality. The difference being to pull out the type 3 attribute (current department) and putting in its own separate dimension. This current dimension table will have a durable supernatural key meaning the key ID will never change for each given product. The ETL strategy would be to update this table for any changes to capture current values. The fact table will contain dual foreign keys; surrogate key linked to the type 2 dimension (Product table) and the durable supernatural key of the current dimension table.  So in a nutshell, these are the Slowly Changing Dimension types. Simple yet complex but never restricted to only the three. 
Read More

Data Quality in an Agile Development Sprint

on 11/6/14 6:18 PM By | Mike Vogt | 0 Comments | Data Quality Data Management
Read More