Add new columns to Hashdiff

Did your source add new columns and you want to add them to the same satellite you held the other columns of this source? Don’t worry! By using append_new_column as value for the on_schema_change config variable in dbt it is possible to append new columns to a Satellite. For that you would also have to add these new columns to the src_payload key in the Satellite v0 model and to the list in the columns key inside the definition of your satellite’s hashdiff in the stage model.

When a new column(s) is(are) added in a Satellite and also added to the Hashdiff list of columns, the calculated Hashdiff in the Stage normally would be different to those already present in the Satellite even if the new column(s) is(are) NULL. This could result in many unnecessary new rows being added into the satellite. We have developed a solution for the Snowflake datavault4dbt adapter, where you can set in the Stage model inside the Satellite Hashdiff definition the key value pair: use_rtrim:true.

By adding this optional key to your stage you can now use the dbt config variables to avoid inserting a new row when all the new columns you have added to the satellite still hold NULL values.

Note that if your Satellite already has the last columns (the ones first inserted in the satellite) holding NULL values, and you add new columns that also hold NULL values, and you were not previously using the key value use_rtrim:true in the hashdiff but started using now, it will still calculate a different Hashdiff then the ones loaded previously in the Satellite. This is because we use delimiters that replace NULL value in the hash calculation, and they will only get removed from the value to be hashed if you use use_rtrim:true, and since your Satellite was previously loaded with the hashdiff calculation that does not remove these delimiters before hashing the hashdiffs results will still be different. We recommend from now on if you use Snowflake adapter always using use_rtrim:true in the hashdiffs of newly added Satellites to avoid this mentioned issue if your satellite may in the future need to be extended to accommodate new columns.

Example 1 – Stage definition

{{ config(materialized='view') }}

{%- set yaml_metadata -%}
source_model: 'source_account'
ldts: 'edwLoadDate'
rsrc: 'edwRecordSource'
hashed_columns: 
    hk_account_h:
        - account_number
        - account_key
    hd_account_s:
        is_hashdiff: true
        use_rtrim: true
        columns:
            - name
            - address
            - phone
            - email
            - new_col1
            - new_col2
{%- endset -%}

{%- set metadata_dict = fromyaml(yaml_metadata) -%}

{{ datavault4dbt.stage(source_model=metadata_dict.get(source_model),
                    ldts=metadata_dict.get(ldts),
                    rsrc=metadata_dict.get(rsrc),
                    hashed_columns=metadata_dict.get(hashed_columns)
                    ) }}

Example 1 – Satellite v0 definition

Note that you have to add the on_schema_change key with the value append_new_columns to your config clause in the Satellite v0 model. In the src_payload just add the new columns **after ** the already present columns (this is important because we trim the hashdiff on the right). In this example below, the columns nameaddressphoneemail were already in the satellite and we added the columns new_col1 and new_col2

{{ config(materialized='incremental', on_schema_change="append_new_columns") }}

{%- set yaml_metadata -%}
parent_hashkey: 'hk_account_h'
src_hashdiff: 'hd_account_s'
src_payload:
    - name
    - address
    - phone
    - email
    - new_col1
    - new_col2    
source_model: 'stage_account'
{%- endset -%}    

{%- set metadata_dict = fromyaml(yaml_metadata) -%}

{{ datavault4dbt.sat_v0(parent_hashkey=metadata_dict.get(parent_hashkey),
                        src_hashdiff=metadata_dict.get(src_hashdiff),
                        source_model=metadata_dict.get(source_model),
                        src_payload=metadata_dict.get(src_payload)
                        ) }}