feat: add scope parameter to insert-only merge strategy (#3827)#3842
Open
mattiasthalen wants to merge 16 commits intodlt-hub:develfrom
Open
feat: add scope parameter to insert-only merge strategy (#3827)#3842mattiasthalen wants to merge 16 commits intodlt-hub:develfrom
mattiasthalen wants to merge 16 commits intodlt-hub:develfrom
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ge SQL Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Also fixes the SQL generation for scope=previous_load: replaces the MERGE ON subquery approach (rejected by DuckDB) with INSERT ... SELECT WHERE NOT EXISTS, and uses _dlt_loads.MAX(load_id) instead of MAX(_dlt_load_id) from the target table to correctly identify the previous load even when it produced no rows. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-only scope Replace INSERT...WHERE NOT EXISTS with a two-step approach when scope is previous_load: DELETE from staging rows already in the previous load, then MERGE ON FALSE so all remaining staging rows are unconditionally inserted. Properly escapes load_id and status column names via escape_column_id. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When strategy is insert-only and x-insert-only-scope is previous_load, pre-filter source Arrow data to exclude rows whose primary keys already exist in the previous load before upserting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds `scope="previous_load"` support to the delta table merge strategy. Before merging, source rows whose PKs exist in the most recently completed load's target rows are filtered out; the remaining rows are force-inserted using a `MERGE ON FALSE` equivalent, mirroring the SQL implementation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…filter Move _get_previous_load_id to shared TableFormatLoadFilesystemJob. Iceberg scope uses append after pre-filtering instead of upsert which would match against all history. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove MAX(_dlt_load_id) fallback in iceberg filter — contradicts decision to use _dlt_loads table. Use null byte separator for composite key concatenation to prevent collision. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Nested tables lack _dlt_load_id — join through root table via _dlt_root_id to filter by previous load. Guard against non-numeric load_ids in filesystem _get_previous_load_id. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…scope Replace hardcoded _dlt_load_id string with C_DLT_LOAD_ID in iceberg filter. Add docstring noting scope filters only run for root tables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace ON FALSE with ON 1 = 0 for T-SQL destinations. Replace tuple IN with EXISTS for composite key DELETE. Raise error when scope=previous_load used with LanceDB (unsupported). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n time Invalid scope values like scope='prev_load' were silently accepted and caused all loaders to fall back to all-history behavior. Now raises ValueErrorWithKnownValues with the list of valid scopes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Child tables in delta/iceberg now join through the root table's _dlt_load_id to determine which rows belong to the previous load, matching the SQL implementation's behavior. The filesystem's prepare_load_table propagates x-insert-only-scope from root to child table schemas, and the filter functions use the root table's _dlt_id to resolve the previous load window for child rows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add an optional
scopeparameter to theinsert-onlymerge strategy that restricts key matching to the most recent load instead of all history.What it does:
scope=None(default): current behavior — dedup keys against entire destination tablescope="previous_load": dedup keys against only the most recent_dlt_load_id, allowing re-insertion of records that reappear after being absent (A→B→A pattern)Implementation across backends:
MERGE ON FALSEto insert remaining rows. Nested tables join through root table since they lack_dlt_load_id._filter_by_previous_loadcompares source Arrow data against target rows from the previous load. Child tables resolved via root table's_dlt_id/_dlt_load_id.EqualTofilter. Child tables resolved via root table.DestinationTerminalException(not supported).Validation: invalid scope values (e.g.
scope='prev_load') are rejected at resource definition time viaValueErrorWithKnownValues.Related Issues
Additional Context
Architecture docs (in dlt-sandbox):
Test coverage (16 tests, all passing):
scope=previous_load(duckdb, delta, iceberg)Tests run locally and pass before submitting.