CDC with MD5 function in Informatica

What is the MD5 Function? MD5 (Message Digest Function) is a hash function which is used to evaluate data integrity.  It is based upon Message-Digest Algorithm 5 (MD5) which calculates the checksum of the input value, it is a one-way cryptographic hash function with a 128-bit hash value. MD5 returns a 32 character string of hexadecimal digits 0-9 & a-f and returns NULL if the input is a null.

When To Use MD5 Function? Only update the changed records (any column change) in the target.  Instead of passing all existing records to the target for update, ( whether changed or unchanged ) it’s always recommended to compare the records.

Column-to-column comparison is little painful if your column counts are high.  Concatenate all columns and use the MD5 Function (source and target records for the same key) and then compare the output from the MD5 Function.  The changed records can be identified and only those records can be updated into the target.  This calculation is done really fast and the output of MD5 Function can be used as a unique key to differentiate records. MD5 function enhances the performance as compared to lookups only when the comparison columns are more than 10.

Limitations: If we have non-string columns (SMALLINT, INT, NUMBER, etc.) we need to convert them into characters using TO_CHAR function because the MD5 Function only validates character strings. Sometimes it does validate the output port using MD5, but the warning message generated may read validated and non-string data type (e.g. SMALLINT) columns are converted to string.  This warning has a great performance impact.

Example: Consider a scenario where the incoming Product records consist of PRODUCT_NUM, PRODUCT_DESC, and address fields which have no primary key, are associated with them.  In such a scenario, it is imperative that a unique identifier be assigned to these records on-the-fly which is immutable.  This on-the-fly unique identifier can also be used in future loads as a key to identify whether an incoming customer record is a potential update or an insert.

MD5 in Infa

Verdict: By using the MD5 values, we can identify whether the data is changed or unchanged without the performance being degraded. The MD5 value is always recommended for scenarios with many comparison columns and no primary key columns. There is limitation, however; the input to the MD5 values needs to be a string by data type and it returns a 32 bit hexadecimal.

Advertisements