Data Virtualization tools

Are they right for you?

Data virtualization tools claim to expose data from disparate sources quickly, to reduce time-to-market (from a consumption perspective). They (e.g. products from Redhat, Informatica, IBM, Cisco, and Denodo to name a few) deliver on this claim by providing an abstraction layer on top of a variety of structured and unstructured data sources. Data virtualization tools utilize a middle-ware server to expose data from disparate sources as single logical entities. They feature multiple layers of abstraction, from unmodified views to ones that employ transformation to canonical standards, depending on the data consumption use case.

Technically, they utilize run-time queries (or API's in the case of some SaaS endpoints) to access the underlying data sources and apply on-the-fly transformation as required. Data from disparate sources is joined in memory to provide a single logical view. Caching is employed to smooth out performance problems.

Care must be taken when aggregating large volumes of data from disparate sources together. Consider the effects of combining (e.g. aggregating) 10M rows of data from two different source systems and then performing an in-memory join – bad things happen (like middle-ware servers run out of memory or source systems thrash).

While using Data Virtualization to reduce your time-to-market in delivery of data to business consumers, employ data transformations, and provide centralized security to a plethora of data from disparate sources, care must be taken to monitor run-time performance (and availability of source systems). When problems are identified, one should consider engineering a solution where data is integrated prior to run-time. While this is sub-optimal, it represents some of the engineering trade-offs that need to be made to balance functionality (including time-to-market) and performance (and availability). Please also note that directly exposing data from operational systems potentially exposes them to unavailability/poor performance, due to run-time load of data virtualization tools.

Some common techniques utilized for data integration (prior to run-time) include operational data stores (ODS), data marts (DM), and operational reporting stores. While there can be any number of technologies (SQL, NoSQL) that support each of these, the key takeaway is that they are properly modeled/engineered (whether on read or on write) to support answering the business objectives at hand.


For Additional Blog Post please check out NVISIA's Enterprise Data Management Page

Topics: Data Management

Written by Mike Vogt

Mike Vogt is a Director on NVISIA's data management team.

Leave a Comment