Avoiding the Data Iceberg with Quick Targeted Analysis

nvisia is an award-winning software development partner driving competitive edge for clients.

banner-quick-targeted-analysis

Have you run into a situation where an agile project flopped because bad data affected the successful delivery of your solution?

Typically, a representative data set is chosen to represent the "happy path" of a business environment. With today's highly integrated systems, the data set's are much more complex and require much more vigilance in testing.

Consider a situation when the "happy path" data set is insufficient to represent real-world conditions and problems are found in QA (in a sprint after the functionality is delivered) or worse yet, in production.

Example

A customer would be upset if "most" of his orders showed up when he was checking on the status of her order.  She inquires as to "why" this is happening and some quick analysis by a CSR reveals inconsistent data in the sales and fulfillment systems.  At this point, there is no quick resolution to the customer's problem -- instead a support ticket is created and escalated (adding to the customer's frustration).

Screen_Shot_2014-11-23_at_5.28.19_PM
Please click the image above to see a "how to" video with QTA using Talend's Open Studio for Data Quality. 

Upon further analysis of the inconsistent data (or "not so happy path"), the following information is determined:
- Using customer number and order date to relate orders to customers isn't the best idea. 
- Looking up customer by name and phone number needs improvement 
- Customer phone numbers have 30% invalid data (no area code, != 10 digits long) 
- Customer address information has 10% invalid data (including 8% invalid postal codes)

Alternative Outcome

A customer checks her orders, which are complete and up-to-date, and she there is no change to her day -- her basic expectations from the merchant have been met.

Alternative Approach

Utilizing agile data profiling (Quick Targeted Analysis), one could quickly determine the quality of the customer data was poor and remediate during the development sprint. The basic idea is to use a data profiling tool (e.g. Talend's Open Studio for Data Quality) and some out of the box metrics like % null, % invalid postal codes, % invalid US phone numbers and pattern frequency (think regex pattern) analyses to quickly identify potential trouble spots within a data set. 

With a very small investment of time (e.g. few hours in a two week sprint), utilizing proper tools, as opposed to a battery of custom SQL queries, potential issues can be identified quickly in the sprint and then remediation or further analysis decisions can be made quickly without significantly impacting the delivery of other items within the sprint.

While I'm not advocating any specific tool, I'm suggesting there's a more efficient way to quickly identify data quality issues, especially those that arise in today's heterogeneous system landscape – part of a Connected Blueprint where risk associated with complex system integration is mitigated leveraging our deep data integration expertise.

 

Here is the original video broken down by method:

Related Articles