Skip to content

feat: add scope parameter to insert-only merge strategy (#3827)#3842

Open
mattiasthalen wants to merge 16 commits intodlt-hub:develfrom
mattiasthalen:feat/3827-insert-only-scope
Open

feat: add scope parameter to insert-only merge strategy (#3827)#3842
mattiasthalen wants to merge 16 commits intodlt-hub:develfrom
mattiasthalen:feat/3827-insert-only-scope

Conversation

@mattiasthalen
Copy link
Copy Markdown
Contributor

Description

Add an optional scope parameter to the insert-only merge strategy that restricts key matching to the most recent load instead of all history.

@dlt.resource(
    write_disposition={
        "disposition": "merge",
        "strategy": "insert-only",
        "scope": "previous_load",
    }
)

What it does:

  • scope=None (default): current behavior — dedup keys against entire destination table
  • scope="previous_load": dedup keys against only the most recent _dlt_load_id, allowing re-insertion of records that reappear after being absent (A→B→A pattern)

Implementation across backends:

  • SQL destinations: pre-filter staging to remove keys present in previous load, then MERGE ON FALSE to insert remaining rows. Nested tables join through root table since they lack _dlt_load_id.
  • Delta: _filter_by_previous_load compares source Arrow data against target rows from the previous load. Child tables resolved via root table's _dlt_id/_dlt_load_id.
  • Iceberg: same approach using pyiceberg scan with EqualTo filter. Child tables resolved via root table.
  • LanceDB: raises DestinationTerminalException (not supported).

Validation: invalid scope values (e.g. scope='prev_load') are rejected at resource definition time via ValueErrorWithKnownValues.

Related Issues

Additional Context

Architecture docs (in dlt-sandbox):

Test coverage (16 tests, all passing):

  • A→B→A pattern with scope=previous_load (duckdb, delta, iceberg)
  • Default scope unchanged (backward compat)
  • Empty target table
  • Nested tables with scope (duckdb, delta, iceberg)
  • Hard delete + scope
  • Invalid scope validation
  • Valid scope acceptance

Tests run locally and pass before submitting.

mattiasthalen and others added 16 commits April 8, 2026 21:26
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ge SQL

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Also fixes the SQL generation for scope=previous_load: replaces the
MERGE ON subquery approach (rejected by DuckDB) with INSERT ... SELECT
WHERE NOT EXISTS, and uses _dlt_loads.MAX(load_id) instead of
MAX(_dlt_load_id) from the target table to correctly identify the
previous load even when it produced no rows.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-only scope

Replace INSERT...WHERE NOT EXISTS with a two-step approach when scope is
previous_load: DELETE from staging rows already in the previous load, then
MERGE ON FALSE so all remaining staging rows are unconditionally inserted.
Properly escapes load_id and status column names via escape_column_id.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When strategy is insert-only and x-insert-only-scope is previous_load,
pre-filter source Arrow data to exclude rows whose primary keys already
exist in the previous load before upserting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds `scope="previous_load"` support to the delta table merge strategy.
Before merging, source rows whose PKs exist in the most recently completed
load's target rows are filtered out; the remaining rows are force-inserted
using a `MERGE ON FALSE` equivalent, mirroring the SQL implementation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…filter

Move _get_previous_load_id to shared TableFormatLoadFilesystemJob.
Iceberg scope uses append after pre-filtering instead of upsert
which would match against all history.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove MAX(_dlt_load_id) fallback in iceberg filter — contradicts
decision to use _dlt_loads table. Use null byte separator for
composite key concatenation to prevent collision.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Nested tables lack _dlt_load_id — join through root table via
_dlt_root_id to filter by previous load. Guard against non-numeric
load_ids in filesystem _get_previous_load_id.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…scope

Replace hardcoded _dlt_load_id string with C_DLT_LOAD_ID in iceberg
filter. Add docstring noting scope filters only run for root tables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace ON FALSE with ON 1 = 0 for T-SQL destinations. Replace
tuple IN with EXISTS for composite key DELETE. Raise error when
scope=previous_load used with LanceDB (unsupported).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n time

Invalid scope values like scope='prev_load' were silently accepted and
caused all loaders to fall back to all-history behavior. Now raises
ValueErrorWithKnownValues with the list of valid scopes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Child tables in delta/iceberg now join through the root table's
_dlt_load_id to determine which rows belong to the previous load,
matching the SQL implementation's behavior. The filesystem's
prepare_load_table propagates x-insert-only-scope from root to child
table schemas, and the filter functions use the root table's _dlt_id
to resolve the previous load window for child rows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mattiasthalen mattiasthalen marked this pull request as ready for review April 9, 2026 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Snapshot-diff ingestion for append-only data warehousing

1 participant