Business
Intelligence Research
I.
ETL Testing
A.
As Informatica Product Specification
1.
Source and Destination Loading one-one (no
transformation)
2.
Source and Destination Loading with
transformation
·
Using pre-built operators in Informatica Product
to build transformation, there is no need of programming skill
3.
Production Validation: validate if the source
data loaded into Production environment correctly
Testing covers:
1.
Data validation within Staging to check all
·
Mapping Rules
·
Transformation Rules
2.
Data validation within Destination to check if
·
Data is present in required format
·
There is no data loss from Source to Destination
C.
Data-centric Testing
Applied specific to
ETL processes where data movement happens
1.
Technical
Testing: Technical testing ensures that the data copied, moved or loaded
from the source system to target system correctly and completely. Technical
testing is performed by comparing the target data against the source data. List
of testing techniques:
a.
Checksum
Comparison: to check if quantitative information of both source and
destination database is the same using Checksum technique. For example: number
of records from source database compared to destination; ACCUMULATED
information on source database compared to CALCULATED information on
destination database, eg: summarized annual data for monthly salaries in the
source database (total salaries of months
of all employees) causes new column in destination database to contain sum
of monthly salaries paid within a year for each year (total salaries of years of all employees). These two values should
be equal
b.
Domain
Comparison: Compare list of unique entries (field-unique of records) in
source database to unique entries in destination database. For example: List of
Employee Name in the Salary table of source database to List of Employee Name
in the destination database; like Dictionary List
c.
Multi-value
Comparison: Similar to Domain/Dictionary/List comparison, multi-value
comparison compares the WHOLE record or the CRITICAL columns between source and
destination database, and MATCHING between these columns. For example: Domain
comparison reports correctness of Employee List between source, destination
database; Checksum comparison reports correctness of Salary entries between
source and destination database; but these comparisons not guarantee the
correctness of assigning Salary entry to Employee entry (MATCHING between
columns). Multi-value comparison discovers such issues by comparing the key
columns/attributes of each record between source and destination database
2.
Business
Testing: To validate business common senses, eg: Salary/Commission cannot
be less than zero. There is a list of
exhaustive rules to test against, and it depends on domain knowledge, industry.
Need
research list of best practices
3.
Reconciliation:
Ensures that the data in the destination database is in agreement with the
overall system requirements. Examples of how the reconciliation helps in
achieving high quality data:
a.
Internal reconciliation: the data within the destination
database gets compared against each other (mostly in business constraint
terms), eg: number of shipments always less than or equal to number of orders,
otherwise it’s invalid
b.
External reconciliation: the data within the
destination database gets compared to other (external) system, eg: Number of
Employees in the destination database cannot be larger than Number of Employees
in the HR Employee Master System
II.
ETL Implementation Strategy
A.
Suggested Strategy by ETLGuru.com
1.
Theory
·
Every time there is a movement of data, there is
a need of data validation
·
There are various of test conditions during
migration from DEV to QA, QA to PRODUCTION
2.
Practice
·
A better ETL strategy is to store all the
BUSINESS RULES into centralized tables, even in source for target system, these
rules can be in SQL text.
·
This is a kind of repository that can be called
from any ETL processes, auditors at any phase of project life cycle. There is NO
need to re-think, re-write the rules
·
Any or all of these rules can be made OPTIONAL,
TOLERANCE can be defined, CALLED immediately after process runs or data can be
audited at leisure
·
This data validation/auditing system basically
contains:
a. The
tables contain the rules
b.The process to call
dynamically
c. The
tables to store results from the execution of the rules
·
Benefits
a. Rules
can be added dynamically with no change to ETL code
b.Rules are stored dynamically
c. Tolerance
level can be changed with ever changing to ETL code
d.Business Rules can be added
or validated by Business Expert without worrying about ETL code
·
This practice can be applied to ETL tools,
Databases: Informatica, DataStage, SyncSort DMExpress, Sunopsis, Oracle,
Sybase, SQL Server Integration (SSIS)/DTS,
III.
ETL Architecture Design
A.
Study shows:
There are 3 proposed layers:
1.
Layer 1: Data relational: extract, transform,
load from source to destination
2.
Layer 2: Control, Log, Security and
Authorization of ETL processes, organize and call sub-processes
3.
Layer 3: Manage and Schedule ETL processes,
Recovery from Failure, Load Balancing, etc.
B.
Incremental Loading Design
1.
Change Data Capture: there are 3 main approaches
·
Log-based CDC
·
Audit columns
·
Calculation of snapshot differentials
2.
F
IV.
Advanced (mostly for Large Scalable
Database/Volume)
A.
MapReduce
B.
Hadoop
C.
A Highly Scalable Dimensional ETL Framework
based on MapReduce