Deduplication¶

How Dango handles duplicate records across four deduplication strategies. Choose the right strategy based on whether your data changes over time and whether you need to track that history.

Which Strategy Should I Use?¶

flowchart TD
    A["Does the source data<br/>change over time?"] -->|No| B["Are records<br/>append-only events?"]
    A -->|Yes| C["Do you need to track<br/>change history?"]

    B -->|"Yes (event logs, transactions)"| D["**append_only**<br/>Add-only, never update"]
    B -->|"No (static reference data)"| E["**none**<br/>No deduplication"]

    C -->|"Yes (audit trail needed)"| F["**scd_type2**<br/>Full change history<br/>via dbt snapshots"]
    C -->|"No (latest value is enough)"| G["**latest_only**<br/>Keep newest per key<br/>(SCD Type 1)"]

    style D fill:#e8f5e9
    style E fill:#fff3e0
    style F fill:#e1f5fe
    style G fill:#f3e5f5

Strategy Overview¶

Strategy	Duplicates	History	Best For	Complexity
`none`	Kept as-is	All records preserved	Fact tables, event logs	Lowest
`latest_only`	Replaced with newest	No history (SCD Type 1)	CRM contacts, product catalogs	Low
`append_only`	Never updated	Immutable record	Transactions, audit logs	Low
`scd_type2`	Versioned with timestamps	Full change history	Price tracking, status changes	Medium

Strategies in Detail¶

`none` — No Deduplication¶

All records are kept exactly as loaded. No deduplication logic is applied.

sources:
  - name: event_logs
    type: local_files
    local_files:
      directory: data/uploads/event_logs
    deduplication: none

When to use:

Event/fact tables where every record is unique by nature
Append-only logs where duplicates don't occur
Sources where you handle deduplication in custom dbt models

What happens: Data is loaded directly into the raw layer. The staging model passes records through without modification.

`latest_only` — Keep Newest Per Key (SCD Type 1)¶

When multiple records share the same primary key, only the most recent version is kept. This is the default for Google Sheets sources.

sources:
  - name: customer_list
    type: google_sheets
    deduplication: latest_only

When to use:

Dimension tables that get overwritten with current values
Google Sheets used as a "current state" reference
Sources where you only care about the latest version of each record

What happens: The staging model includes a ROW_NUMBER() window function partitioned by the primary key, ordered by the update timestamp descending. Only the first row (most recent) per key is kept.

Requirements: The source must have both a primary key column (e.g., id) and an update timestamp column (e.g., updated_at).

`append_only` — Add-Only, Never Update¶

Records are only ever inserted, never updated or deleted. Each sync appends new records without modifying existing ones.

sources:
  - name: transactions
    type: stripe
    deduplication: append_only

When to use:

Immutable event streams (payments, page views, sensor readings)
Data where historical accuracy is critical and records should never change
Sources that naturally produce append-only data

What happens: The staging model passes all records through. dlt's incremental loading ensures only new records are fetched on each sync.

`scd_type2` — Full Change History via dbt Snapshots¶

Tracks every change to a record over time using dbt snapshots. Each version of a record gets dbt_valid_from and dbt_valid_to timestamps showing when that version was active.

sources:
  - name: product_prices
    type: dlt_native
    deduplication: scd_type2

When to use:

Price tracking (what was the price on a given date?)
Status tracking (when did this order move to "shipped"?)
Compliance/audit requirements (prove what the data looked like at a point in time)

What happens: Dango generates a dbt snapshot file that compares the current source data against the previous snapshot. Changed records get a new version row, and the old version's dbt_valid_to is set to the current timestamp.

Generated columns:

Column	Description
`dbt_valid_from`	When this version of the record became active
`dbt_valid_to`	When this version was superseded (NULL = current)
`dbt_scd_id`	Unique identifier for this version
`dbt_updated_at`	When the snapshot was executed

Query example — get the current version:

SELECT *
FROM snapshots.snp_product_prices
WHERE dbt_valid_to IS NULL;

Query example — get the value at a specific date:

SELECT *
FROM snapshots.snp_product_prices
WHERE dbt_valid_from <= '2026-03-15'
  AND (dbt_valid_to > '2026-03-15' OR dbt_valid_to IS NULL);

Snapshot configuration details

The generated snapshot uses dbt's timestamp strategy by default, requiring a unique_key and updated_at column. If no timestamp column is available, the check strategy is used instead (compares all columns for changes).

Snapshots are stored in the snapshots schema in DuckDB and run automatically during dango sync as part of the dbt execution.

See Snapshots (SCD Type 2) for the full dbt snapshot guide.

Configuration¶

Set the deduplication strategy in your sources.yml:

sources:
  - name: my_source
    type: stripe
    deduplication: latest_only  # none | latest_only | append_only | scd_type2

If no deduplication field is specified, Dango uses sensible defaults:

Google Sheets: latest_only (sheets are typically current-state reference data)
All other sources: No deduplication applied (data loaded as-is)

Auto-Inference¶

When generating staging models, Dango can automatically infer a deduplication approach based on column patterns:

If the table has an updated_at or modified_at column and an ID column (id, uuid, key), the generated staging model includes a ROW_NUMBER() window function to keep only the most recent record per key — equivalent to the latest_only strategy
Otherwise, no deduplication is applied in the staging model

This auto-inference only affects auto-generated staging models. You can always override it by setting deduplication explicitly in sources.yml or editing the generated staging SQL directly.

Common Patterns¶

Which strategy for which source?¶

Source Type	Recommended Strategy	Reason
Google Sheets	`latest_only` (default)	Sheets are living documents — you want current state
Stripe charges	`append_only`	Charges are immutable events
HubSpot contacts	`latest_only`	Contact records get updated frequently
Product prices	`scd_type2`	Track price changes over time
Server logs	`none`	Each log entry is unique
CSV exports	`none` or `latest_only`	Depends on whether exports contain updates

Changing strategies¶

You can change a source's deduplication strategy at any time:

sources:
  - name: contacts
    type: hubspot
    deduplication: scd_type2  # Changed from latest_only

After changing to scd_type2, run a full refresh to initialize the snapshot baseline:

dango sync contacts --full-refresh

Changing from scd_type2 to another strategy

Switching away from scd_type2 does not delete the snapshot table. The snapshot stops receiving updates, but existing history is preserved in the snapshots schema.

Strategy Comparison¶

Aspect	`none`	`latest_only`	`append_only`	`scd_type2`
Duplicate handling	Kept	Replaced	Prevented by cursor	Versioned
Historical data	All records	Latest only	All records	All versions
Storage growth	Linear	Bounded	Linear	Higher (versions)
Query complexity	Simple	Simple	Simple	Moderate
dbt model type	Staging	Staging (window fn)	Staging	Snapshot
Requires primary key	No	Yes	No	Yes
Requires timestamp	No	Yes	No	Recommended

Sync Modes — incremental vs full refresh and how data is loaded
Staging Models — how deduplication is applied in auto-generated models
Snapshots (SCD Type 2) — full guide to dbt snapshots
Adding Sources — set up a source and choose its deduplication strategy

Deduplication¶

Which Strategy Should I Use?¶

Strategy Overview¶

Strategies in Detail¶

none — No Deduplication¶

latest_only — Keep Newest Per Key (SCD Type 1)¶

append_only — Add-Only, Never Update¶

scd_type2 — Full Change History via dbt Snapshots¶

Configuration¶

Auto-Inference¶

Common Patterns¶

Which strategy for which source?¶

Changing strategies¶

Strategy Comparison¶

Related Pages¶

`none` — No Deduplication¶

`latest_only` — Keep Newest Per Key (SCD Type 1)¶

`append_only` — Add-Only, Never Update¶

`scd_type2` — Full Change History via dbt Snapshots¶