Skip to content

feat(ingest/sigma): emit Data Models as Containers with per-element Datasets#17173

Merged
kyungsoo-datahub merged 17 commits intomasterfrom
feat/sigma-dm-containers
Apr 27, 2026
Merged

feat(ingest/sigma): emit Data Models as Containers with per-element Datasets#17173
kyungsoo-datahub merged 17 commits intomasterfrom
feat/sigma-dm-containers

Conversation

@kyungsoo-datahub
Copy link
Copy Markdown
Contributor

Summary

Opt-in Sigma Data Model ingestion. DM → Container; each element → Dataset with SchemaMetadata and UpstreamLineage (intra-DM + external Sigma Dataset refs).
Element URNs key off the immutable DM UUID: urn:li:dataset:(sigma,<dataModelId>.<elementId>,env). Additive; no existing URNs change.

Emitted

  • Container (subtype Sigma Data Model)
  • Dataset (subtype Sigma Data Model Element) with SchemaMetadata
  • UpstreamLineage (intra-DM + external Sigma Datasets)

Config

  • ingest_data_models (default False)
  • data_model_pattern

…atasets

Opt-in (`ingest_data_models`, default False) Sigma Data Model ingestion.
Each DM becomes a Container and each element inside becomes a Dataset with
SchemaMetadata and UpstreamLineage. Lineage covers intra-DM element refs
and external upstreams (Sigma Datasets ingested in the same run).

Element Dataset URNs are keyed by the immutable Data Model UUID
(`urn:li:dataset:(sigma,<dataModelId>.<elementId>,env)`) so attachments
survive Sigma slug rotation; the slug is captured on
`customProperties.dataModelUrlId`.

Additive — no existing URNs change. Cross-DM lineage, personal-space DM
discovery, and workbook-to-DM element linking arrive in a follow-up PR.
@github-actions github-actions Bot added ingestion PR or Issue related to the ingestion of metadata docs Issues and Improvements to docs labels Apr 24, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Linear: ING-2413

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 95.01312% with 19 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...estion/src/datahub/ingestion/source/sigma/sigma.py 92.48% 10 Missing ⚠️
...src/datahub/ingestion/source/sigma/data_classes.py 90.38% 5 Missing ⚠️
...on/src/datahub/ingestion/source/sigma/sigma_api.py 98.20% 3 Missing ⚠️
...stion/src/datahub/ingestion/source/sigma/config.py 96.29% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@datahub-connector-tests
Copy link
Copy Markdown

datahub-connector-tests Bot commented Apr 24, 2026

Connector Tests Results

Connector tests failed for commit 6eceb7c

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

- _paginated_entries: per-entry try/except on model_validate so a
  single malformed element/column cannot empty the whole page list,
  and report.warning on both per-entry drop and outer pagination
  failure so operators see the issue in the ingestion report (it was
  previously only debug-logged via _log_http_error). Partial results
  collected before a mid-pagination HTTP error are preserved.
- get_data_models: delegate the outer pagination to _paginated_entries
  (remove the duplicated nextPage/nextPageToken loop and the inlined
  per-entry ValidationError handling) so future pagination fixes are
  applied in one place.
- SigmaSourceConfig: add a model_validator that warns when
  data_model_pattern is customized while ingest_data_models=False —
  the pattern is silently ignored today, which has tricked operators.
- DatasetProperties for DM elements: use description=None instead of
  "" so aspect-replace semantics do not blank out a user-edited
  description on re-ingestion.
- Document the paths[1:] drop-workspace-prefix intent in sigma.py
  and the vizualizationType API-spelling match in data_classes.py.
- Unit tests: cover the file_meta fallback branch in
  _assemble_data_model (filled-from-file, DM-payload-wins, None-safe) —
  closes the test-coverage gap called out in review.
mypy inferred the ``base`` dict literals in the new
``TestAssembleDataModelFileMetaFallback`` fixtures as ``dict[str, str]``
because every initial value was a string. ``base.update(overrides)``
then failed to type-check since ``**overrides: object`` produces a
``dict[str, object]`` that cannot widen a ``dict[str, str]``. Annotate
both ``base`` dicts explicitly as ``Dict[str, object]`` so the update
and later ``model_validate`` calls pass under ``mypy --strict``
baselines without relaxing runtime behavior.
…ination cycle guard

Addresses PR review feedback on the DM ingestion hardening.

M1 (Major): ``gen_containers`` for a Sigma DM received
``description=data_model.description or ""``, so a DM with no
description sent ``""`` to GMS and would blank out any description the
user had edited in the DataHub UI on the next ingest. Pass ``None``
through instead, matching the element-Dataset fix and the Qlik
connector pattern.

M2 (Major): ``_get_data_model_lineage_entries`` read only the first
page of ``/dataModels/{id}/lineage``. For a DM whose lineage spans
multiple pages, elements on page 2+ silently lost their upstreams
(``data_model_element_upstreams_unresolved`` did not bump because the
element's ``source_ids`` were simply absent). Split the HTTP loop out
of ``_paginated_entries`` into ``_paginated_raw_entries`` and have
``/lineage`` consume it. Expected "no lineage" statuses (400/403/404/
500) are still swallowed silently via a new ``silent_statuses``
parameter applied only to the first page.

M3 (Major): ``_paginated_raw_entries`` now tracks already-seen
``nextPage`` / ``nextPageToken`` cursors in a ``Set[str]``. A broken
Sigma proxy that echoes the same cursor on every response used to loop
forever and accumulate duplicates; it now breaks with a report
warning.

m1 (polish): DM element Datasets now carry
``externalUrl=<data_model.url>?:nodeId=<elementId>`` so users can
click from a DM-element Dataset back into Sigma. Mirrors the
deep-link shape used for workbook elements in ``get_page_elements``.

m3 (polish): ``SigmaDataModelElement._discard_api_bare_string_columns``
now emits a ``logger.debug`` when it drops bare-string columns, so an
operator investigating an empty-schema DM element can see the columns
were intentionally deferred to ``/columns``.

m4 (polish): ``TestAssembleDataModelFileMetaFallback`` replaces the
``patch.stopall()`` try/finally pattern with a nested
``with patch.object(...)`` context-manager helper, guaranteeing
cleanup even if an assertion raises mid-test.

Tests: adds ``TestPaginatedRawEntries`` covering multi-page via
``nextPage`` and ``nextPageToken``, the cycle-protection break, the
silent-status swallow, and an end-to-end multi-page ``/lineage``
regression for M2.
…h shape, report hygiene

Review feedback on the DM ingestion path:

- C1: `_paginated_entries` accepts an optional natural-key dedup and the
  DM element/column/data-model callers opt in on elementId/columnId/
  dataModelId. Raw DM lineage entries dedupe on `(type, nodeId)` inside
  `_get_data_model_lineage_entries`. Cycle-echo or server-side page
  overlap no longer double-emits MCPs or inflates counters.
- C2: `get_data_models` falls back to `data_model.workspaceId` when
  `/files` lacks a workspace row, matching how the payload is already
  honored for emission and customProperties.
- M2: Drop the trailing element-name entry from DM element
  `BrowsePathsV2` so breadcrumbs terminate at the DM Container, matching
  every other Sigma browsepath builder.
- M3: Remove 500 from `_get_data_model_lineage_entries` silent_statuses
  so a degraded Sigma region surfaces loud warnings instead of zero
  lineage with zero telemetry.
- M5: DataModelKey docstring clarifies that multi-env operators should
  use `platform_instance` to disambiguate DM Container URNs.
- M4: `sigma_pre.md` notes that external-upstream resolution requires
  the referenced Sigma Dataset to be ingested in the same run, plus the
  multi-env Container caveat.
- m2 (polish): Move the `data_model_pattern`-ignored check from the
  config-level `model_validator` into `SigmaSource.__init__` and emit
  it via `self.reporter.warning(...)` so CI / dashboards gating on
  report warnings see it.
- m7 (polish): Cap per-endpoint malformed-entry warnings at 10 while
  tallying the full drop count on `pagination_malformed_entries_dropped`
  to prevent warning-storm on a vendor-wide regression.
- New counters: `pagination_duplicate_entries_dropped`,
  `pagination_malformed_entries_dropped`.
- New tests: dedup behavior (typed + lineage), 500 surfaces warning,
  workspace-payload fallback, malformed-entry rate limit.
… test

``_log_http_error`` reads ``e.response.status_code`` off the exception,
which real ``requests`` populates on ``raise_for_status`` failures. The
unit test was constructing ``HTTPError`` directly with only a message,
so ``e.response`` was ``None`` and the test tripped an ``AttributeError``
inside the logger before the warning was recorded. Attach the mocked
500 response onto the HTTPError so the code path under test (M3:
surface a warning on /lineage 5xx) is exercised as in production.
…act_lineage gate

Review follow-ups on PR #17141 PR 1:

C1: `_get_data_model_lineage_entries` dedup key was `(type, nodeId)`,
but the real DM `/lineage` payload does not carry `nodeId` -- `element`
rows use `elementId` and `dataset`/`table` rows use `inodeId`. Under
the old key every real row collapsed to `(<type>, "")` so after the
first entry per type everything was silently dropped. Switch to a
shape-aware key and preserve rows missing the expected identifier
rather than collapsing them.

M7: Rewrote the dedup unit tests against the real DM /lineage shape
(elementId / inodeId) and added a regression for entries missing the
natural key.

Tighten test_sigma_ingest_data_models_external_dataset_not_ingested
to assert exact counter values (1 intra, 1 resolved external,
1 unresolved external, 0 unknown shape) so C1 cannot regress silently.

M1: Split `data_model_element_upstreams_unresolved` into
`_unresolved_external` (target dataset exists but was filtered out of
the run) vs `_unknown_shape` (source_id shapes this release does not
parse, e.g. cross-DM refs). Aggregate counter kept for dashboards.

M2: Docstring on `_get_data_model_lineage_entries` now reflects the
shape-aware key choice and the rationale for preserving entries with
missing identifiers.

M5: Duplicate fieldPath collapse is now deterministic -- prefer the
row with a formula set (user-authored calc field), tie-break on the
lexicographically smallest columnId. Dropped columnIds are emitted
to debug logs for reconciliation.

M6: `/dataModels/{id}/lineage` is now gated on `extract_lineage` so
`ingest_data_models=True` + `extract_lineage=False` emits DM
Containers, element Datasets, and SchemaMetadata without hitting any
lineage endpoint.

M4: Softened hard-coded `StringTypeClass` / `nativeDataType="String"`
on DM element columns to `NullTypeClass` + `"unknown"` via a named
constant, so downstream type-aware features can distinguish
"no type returned by Sigma" from "actually a string."

Doc updates: sigma_pre.md covers the new counter split, the
extract_lineage gate, and the softened column type; updating-datahub.md
bullet reflects the same.
…ource_id

Mirror the success-path dedup in `_gen_data_model_element_upstream_lineage`
so duplicate `source_ids` in a single element's vendor payload (diamond
reference to the same un-ingested `inode-X`) no longer over-count the
`data_model_element_upstreams_unresolved` and
`data_model_element_upstreams_unresolved_external` / `_unknown_shape`
buckets. Emitted aspects are unaffected (they were already deduped via
the `seen` set); only report-counter correctness changes.

Also add two integration tests:
- `test_sigma_ingest_data_models_extract_lineage_false` pins the
  `ingest_data_models=True + extract_lineage=False` opt-out (no
  `/lineage` HTTP call, Container + element Datasets + SchemaMetadata
  still emitted, no `upstreamLineage` aspects).
- `test_sigma_ingest_data_models_unresolved_counters_dedup_duplicate_source_id`
  regresses the counter dedup fix above.

Scrub a handful of code/test comments that referenced internal review-
round labels (`Review M*`, `Review C1`).
C1: Revert unintentional chart-inputs lexicographic sort in
_gen_elements_workunit. The sorted(...) slipped into this DM-focused
PR would churn every Sigma chart's ChartInfo aspect on first re-ingest
after upgrade for any tenant whose inputs were not already alphabetical.

C2: Unify data_model.workspaceId handling between filtering and
rendering in _assemble_data_model. When /files and /dataModels
disagree, DMs were filtered under the /files workspace but rendered /
counted under the /dataModels payload workspace -- a split-brain the
new resolved_workspace_id parameter closes.

M1: Emit OwnershipClass on DM element Datasets when ingest_owner is
enabled and data_model.createdBy resolves, so 'datasets owned by X'
DataHub filters no longer miss DM elements despite the author
appearing on the enclosing Container.

M2: Wrap per-DM assembly in a try/except so one malformed DM does not
abort the whole feed. Mirrors get_sigma_workbooks.

M3: Drop ignoreCase from the default-pattern detector -- a user who
pins ignoreCase: False on an otherwise-default pattern is semantically
still 'match everything' and must not trigger the 'data_model_pattern
ignored' warning.

M5: Merge sourceIds on shape-aware dedup collisions in
_get_data_model_lineage_entries rather than silently dropping the
second occurrence. Preserves all upstreams if Sigma ever splits one
element's lineage across multiple rows.

M6: Populate sigma_dataset_urn_by_url_id eagerly before DM iteration
in get_workunits_internal so DM external-upstream resolution does not
depend on pipeline generator drain order.

Tests: unit coverage for C1 (insertion-order pin),
ignoreCase-False-default non-warning, lineage sourceIds union on
collision, eager dataset-URN-map population, DM element owner
emission (3 cases), and per-DM isolation on assembly failure.
Integration coverage for C2 (mismatched /files vs /dataModels
workspaceId).
@github-actions
Copy link
Copy Markdown
Contributor

Linear: ING-2416

Two DM-ingest integration tests filtered MCEs by substring-matching the
DM urlId (CDJLIyOhUoKBSEVI8Wr4n) against each entityUrn. But DM
element URNs embed the immutable dataModelId UUID
(147a4d09-a686-4eea-b183-9b82aa0f7beb) per the URN-keying invariant
documented in updating-datahub.md -- urlId never appears in a URN.

That made the negative assertions pass tautologically (empty filter
list) and the one positive assertion in
test_sigma_ingest_data_models_extract_lineage_false fail hard (asserted
a non-empty list that was always empty):

  assert element_schema_aspects, 'DM element SchemaMetadata should ...'

Switch the substring filter to the dataModelId UUID so:
- the positive assertion actually sees the three DM element
  SchemaMetadata aspects, and
- the paired negative assertion on UpstreamLineage becomes a real
  regression guard instead of tautologically empty.

Also tighten the same filter in test_sigma_ingest_data_models_lineage_http_error
for consistency.
Two integration tests duplicate coverage already pinned at the unit
layer, where the same invariants run faster and isolate the assertion
to the exact code path under test. Dropped:

- test_sigma_ingest_data_models_next_page_token_pagination
  Pinned: get_data_models walks both pages when first page advertises
  a nextPageToken. Already covered by
  TestPaginatedRawEntries::test_collects_entries_across_pages_via_nextPageToken
  (and the cycle-guard sibling), which test the same paginator code path
  on the same _paginated_raw_entries helper that get_data_models uses.

- test_sigma_ingest_data_models_schema_duplicate_fieldpath_dropped
  Pinned: a DM element with two columns sharing the same fieldPath
  emits one SchemaMetadata field and bumps the dedup counter. Already
  covered by TestSchemaMetadataEmission::test_duplicate_fieldpath_tiebreak_*
  which assert len(fields) == 1 and the counter increment, plus the
  exact tie-break rule (formula wins, then smallest columnId).

Tests not dropped despite a partial overlap:

- test_sigma_ingest_data_models_unresolved_counters_dedup_duplicate_source_id
  exercises a *different* code path (the unresolved_seen set inside the
  per-element source_id walk in sigma.py), not the row-level dedup that
  TestLineageDedupMerge tests at the API layer. Kept.

- test_sigma_ingest_data_models_default_off pins the *default value* of
  ingest_data_models (omits the flag from config), distinct from
  test_sigma_ingest_data_models_disabled which sets it explicitly to
  False. If the default ever flips, only the former fires. Kept.

Net: -184 LOC in tests/integration/sigma/test_sigma.py with zero
coverage loss; the corresponding 9 unit tests pass locally.
@alwaysmeticulous
Copy link
Copy Markdown

alwaysmeticulous Bot commented Apr 24, 2026

🔴 Meticulous spotted visual differences in 9 of 1455 screens tested: view and approve differences detected.

Meticulous evaluated ~10 hours of user flows against your PR.

Last updated for commit b5fc763 ruff fixed. This comment will update as new commits are pushed.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Bundle Report

Bundle size has no change ✅

Copy link
Copy Markdown
Contributor

@treff7es treff7es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good. It would be nice to have the new model entities in golden tests as well

"""

def _dm(self, **overrides: object) -> SigmaDataModel:
import datetime as _dt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imports should be at the top

# Same golden as the baseline test — the ingest_data_models=False path
# should produce identical output to the unconfigured default when no DM
# mocks are registered.
import json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imports should be at the top

pipeline.run()
pipeline.raise_from_status()

import json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Movet his to the top, please

f"{[r.url for r in lineage_hits]}"
)

import json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

report = _sigma_report(pipeline)
assert report.data_models_without_workspace == 1

import json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

pipeline.run()
pipeline.raise_from_status()

import json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

pipeline.run()
pipeline.raise_from_status()

import json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Callable,
Deque,
Dict,
Hashable,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Python 3.9, typing.Hashable is a deprecated alias for collections.abc.Hashable (PEP 585). Ruff rule UP035 flags this.

@maggiehays maggiehays added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Apr 27, 2026
…s.abc.Hashable

typing.Hashable is a deprecated alias since Python 3.9 (PEP 585).
Use collections.abc.Hashable directly as flagged by ruff UP035.
Address reviewer feedback: ``import json`` and ``import datetime as _dt``
were scattered inside test functions; move them to the module-level import
block where they belong.
@kyungsoo-datahub kyungsoo-datahub merged commit 58aab9b into master Apr 27, 2026
51 of 52 checks passed
@kyungsoo-datahub kyungsoo-datahub deleted the feat/sigma-dm-containers branch April 27, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Issues and Improvements to docs ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants