Introduction:
Capturing and preserving the state of data across time is one of the core functions of a data warehouse, but CDC can be utilized in any database or data integration tool. There are many methodologies such as Timestamp, Versioning, Status indicators, Triggers and Transaction logs exists but MD5 function outlines on working with Checksum.
Overview:
MD5 stands for Message Digest 5 algorithm.It calculates the checksum of the input value using a cryptographic Message-Digest algorithm 5 and returns a128-bit 32 character string of hexadecimal digits (0 - F).
Advantage of using MD5 function is that, it will reduce overall ETL run time and also reduces cache memory usage by caching only required fields which are utmost necessary.
Implementation Steps :
- Identify the ports from the source which are subjected to change.
- Concatenate all the ports and pass them as parameter to MD5 function in expression transformation
- Map the MD5 function output to a checksum output port in the expression transformation
- Lookup on target primary keys, checksum port
- Detect changes by checking for nullability of primary keys and compare source checksum with the target checksum field
- Update checksum field with other fields if primary key exist in the target else insert records to the target
Advantages:
- Minimize Lookup Cache creation time
- Checksum comparisons are faster
- Overall improvement in ETL run time
nice work sir informatica Online Course
ReplyDelete