Tuesday, May 29, 2012

Very interesting - ETL Framework

Wednesday, May 16, 2012

Myself study on BI - Business Intelligence (TBC)


Business Intelligence Research

I.                   ETL Testing

A.                 As Informatica Product Specification

1.       Source and Destination Loading one-one (no transformation)
2.       Source and Destination Loading with transformation
·         Using pre-built operators in Informatica Product to build transformation, there is no need of programming skill
3.       Production Validation: validate if the source data loaded into Production environment correctly

B.                 As GeekInterview.com

Testing covers:
1.       Data validation within Staging to check all
·         Mapping Rules
·         Transformation Rules
2.       Data validation within Destination to check if
·         Data is present in required format
·         There is no data loss from Source to Destination

C.                 Data-centric Testing

Applied specific to ETL processes where data movement happens
1.       Technical Testing: Technical testing ensures that the data copied, moved or loaded from the source system to target system correctly and completely. Technical testing is performed by comparing the target data against the source data. List of testing techniques:
a.       Checksum Comparison: to check if quantitative information of both source and destination database is the same using Checksum technique. For example: number of records from source database compared to destination; ACCUMULATED information on source database compared to CALCULATED information on destination database, eg: summarized annual data for monthly salaries in the source database (total salaries of months of all employees) causes new column in destination database to contain sum of monthly salaries paid within a year for each year (total salaries of years of all employees). These two values should be equal
b.      Domain Comparison: Compare list of unique entries (field-unique of records) in source database to unique entries in destination database. For example: List of Employee Name in the Salary table of source database to List of Employee Name in the destination database; like Dictionary List
c.       Multi-value Comparison: Similar to Domain/Dictionary/List comparison, multi-value comparison compares the WHOLE record or the CRITICAL columns between source and destination database, and MATCHING between these columns. For example: Domain comparison reports correctness of Employee List between source, destination database; Checksum comparison reports correctness of Salary entries between source and destination database; but these comparisons not guarantee the correctness of assigning Salary entry to Employee entry (MATCHING between columns). Multi-value comparison discovers such issues by comparing the key columns/attributes of each record between source and destination database
2.       Business Testing: To validate business common senses, eg: Salary/Commission cannot be less than zero. There is a list of exhaustive rules to test against, and it depends on domain knowledge, industry. Need research list of best practices
3.       Reconciliation: Ensures that the data in the destination database is in agreement with the overall system requirements. Examples of how the reconciliation helps in achieving high quality data:
a.       Internal reconciliation: the data within the destination database gets compared against each other (mostly in business constraint terms), eg: number of shipments always less than or equal to number of orders, otherwise it’s invalid
b.      External reconciliation: the data within the destination database gets compared to other (external) system, eg: Number of Employees in the destination database cannot be larger than Number of Employees in the HR Employee Master System


II.               ETL Implementation Strategy

A.                 Suggested Strategy by ETLGuru.com

1.       Theory
·         Every time there is a movement of data, there is a need of data validation
·         There are various of test conditions during migration from DEV to QA, QA to PRODUCTION
2.       Practice
·         A better ETL strategy is to store all the BUSINESS RULES into centralized tables, even in source for target system, these rules can be in SQL text.
·         This is a kind of repository that can be called from any ETL processes, auditors at any phase of project life cycle. There is NO need to re-think, re-write the rules
·         Any or all of these rules can be made OPTIONAL, TOLERANCE can be defined, CALLED immediately after process runs or data can be audited at leisure
·         This data validation/auditing system basically contains:
a. The tables contain the rules
b.The process to call dynamically
c. The tables to store results from the execution of the rules
·         Benefits
a. Rules can be added dynamically with no change to ETL code
b.Rules are stored dynamically
c. Tolerance level can be changed with ever changing to ETL code
d.Business Rules can be added or validated by Business Expert without worrying about ETL code
·         This practice can be applied to ETL tools, Databases: Informatica, DataStage, SyncSort DMExpress, Sunopsis, Oracle, Sybase, SQL Server Integration (SSIS)/DTS,

III.            ETL Architecture Design

A.                 Study shows:

There are 3 proposed layers:
1.       Layer 1: Data relational: extract, transform, load from source to destination
2.       Layer 2: Control, Log, Security and Authorization of ETL processes, organize and call sub-processes
3.       Layer 3: Manage and Schedule ETL processes, Recovery from Failure, Load Balancing, etc.

B.                 Incremental Loading Design

1.       Change Data Capture: there are 3 main approaches
·         Log-based CDC
·         Audit columns
·         Calculation of snapshot differentials
2.       F

IV.             Advanced (mostly for Large Scalable Database/Volume)

A.                 MapReduce



B.                 Hadoop



C.                 A Highly Scalable Dimensional ETL Framework based on MapReduce






Digital Inspiration Technology Guide

Change the world with your passion