feat(ingest/sigma): emit Data Models as Containers with per-element Datasets by kyungsoo-datahub · Pull Request #17173 · datahub-project/datahub

kyungsoo-datahub · 2026-04-24T03:59:55Z

Summary

Opt-in Sigma Data Model ingestion. DM → Container; each element → Dataset with SchemaMetadata and UpstreamLineage (intra-DM + external Sigma Dataset refs).
Element URNs key off the immutable DM UUID: urn:li:dataset:(sigma,<dataModelId>.<elementId>,env). Additive; no existing URNs change.

Emitted

Container (subtype Sigma Data Model)
Dataset (subtype Sigma Data Model Element) with SchemaMetadata
UpstreamLineage (intra-DM + external Sigma Datasets)

Config

ingest_data_models (default False)
data_model_pattern

…atasets Opt-in (`ingest_data_models`, default False) Sigma Data Model ingestion. Each DM becomes a Container and each element inside becomes a Dataset with SchemaMetadata and UpstreamLineage. Lineage covers intra-DM element refs and external upstreams (Sigma Datasets ingested in the same run). Element Dataset URNs are keyed by the immutable Data Model UUID (`urn:li:dataset:(sigma,<dataModelId>.<elementId>,env)`) so attachments survive Sigma slug rotation; the slug is captured on `customProperties.dataModelUrlId`. Additive — no existing URNs change. Cross-DM lineage, personal-space DM discovery, and workbook-to-DM element linking arrive in a follow-up PR.

github-actions · 2026-04-24T04:00:06Z

Linear: ING-2413

codecov · 2026-04-24T04:03:36Z

Codecov Report

❌ Patch coverage is 95.01312% with 19 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...estion/src/datahub/ingestion/source/sigma/sigma.py	92.48%	10 Missing ⚠️
...src/datahub/ingestion/source/sigma/data_classes.py	90.38%	5 Missing ⚠️
...on/src/datahub/ingestion/source/sigma/sigma_api.py	98.20%	3 Missing ⚠️
...stion/src/datahub/ingestion/source/sigma/config.py	96.29%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

datahub-connector-tests · 2026-04-24T04:40:58Z

Connector Tests Results

Connector tests failed for commit 6eceb7c

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

- _paginated_entries: per-entry try/except on model_validate so a single malformed element/column cannot empty the whole page list, and report.warning on both per-entry drop and outer pagination failure so operators see the issue in the ingestion report (it was previously only debug-logged via _log_http_error). Partial results collected before a mid-pagination HTTP error are preserved. - get_data_models: delegate the outer pagination to _paginated_entries (remove the duplicated nextPage/nextPageToken loop and the inlined per-entry ValidationError handling) so future pagination fixes are applied in one place. - SigmaSourceConfig: add a model_validator that warns when data_model_pattern is customized while ingest_data_models=False — the pattern is silently ignored today, which has tricked operators. - DatasetProperties for DM elements: use description=None instead of "" so aspect-replace semantics do not blank out a user-edited description on re-ingestion. - Document the paths[1:] drop-workspace-prefix intent in sigma.py and the vizualizationType API-spelling match in data_classes.py. - Unit tests: cover the file_meta fallback branch in _assemble_data_model (filled-from-file, DM-payload-wins, None-safe) — closes the test-coverage gap called out in review.

mypy inferred the ``base`` dict literals in the new ``TestAssembleDataModelFileMetaFallback`` fixtures as ``dict[str, str]`` because every initial value was a string. ``base.update(overrides)`` then failed to type-check since ``**overrides: object`` produces a ``dict[str, object]`` that cannot widen a ``dict[str, str]``. Annotate both ``base`` dicts explicitly as ``Dict[str, object]`` so the update and later ``model_validate`` calls pass under ``mypy --strict`` baselines without relaxing runtime behavior.

…ination cycle guard Addresses PR review feedback on the DM ingestion hardening. M1 (Major): ``gen_containers`` for a Sigma DM received ``description=data_model.description or ""``, so a DM with no description sent ``""`` to GMS and would blank out any description the user had edited in the DataHub UI on the next ingest. Pass ``None`` through instead, matching the element-Dataset fix and the Qlik connector pattern. M2 (Major): ``_get_data_model_lineage_entries`` read only the first page of ``/dataModels/{id}/lineage``. For a DM whose lineage spans multiple pages, elements on page 2+ silently lost their upstreams (``data_model_element_upstreams_unresolved`` did not bump because the element's ``source_ids`` were simply absent). Split the HTTP loop out of ``_paginated_entries`` into ``_paginated_raw_entries`` and have ``/lineage`` consume it. Expected "no lineage" statuses (400/403/404/ 500) are still swallowed silently via a new ``silent_statuses`` parameter applied only to the first page. M3 (Major): ``_paginated_raw_entries`` now tracks already-seen ``nextPage`` / ``nextPageToken`` cursors in a ``Set[str]``. A broken Sigma proxy that echoes the same cursor on every response used to loop forever and accumulate duplicates; it now breaks with a report warning. m1 (polish): DM element Datasets now carry ``externalUrl=<data_model.url>?:nodeId=<elementId>`` so users can click from a DM-element Dataset back into Sigma. Mirrors the deep-link shape used for workbook elements in ``get_page_elements``. m3 (polish): ``SigmaDataModelElement._discard_api_bare_string_columns`` now emits a ``logger.debug`` when it drops bare-string columns, so an operator investigating an empty-schema DM element can see the columns were intentionally deferred to ``/columns``. m4 (polish): ``TestAssembleDataModelFileMetaFallback`` replaces the ``patch.stopall()`` try/finally pattern with a nested ``with patch.object(...)`` context-manager helper, guaranteeing cleanup even if an assertion raises mid-test. Tests: adds ``TestPaginatedRawEntries`` covering multi-page via ``nextPage`` and ``nextPageToken``, the cycle-protection break, the silent-status swallow, and an end-to-end multi-page ``/lineage`` regression for M2.

…h shape, report hygiene Review feedback on the DM ingestion path: - C1: `_paginated_entries` accepts an optional natural-key dedup and the DM element/column/data-model callers opt in on elementId/columnId/ dataModelId. Raw DM lineage entries dedupe on `(type, nodeId)` inside `_get_data_model_lineage_entries`. Cycle-echo or server-side page overlap no longer double-emits MCPs or inflates counters. - C2: `get_data_models` falls back to `data_model.workspaceId` when `/files` lacks a workspace row, matching how the payload is already honored for emission and customProperties. - M2: Drop the trailing element-name entry from DM element `BrowsePathsV2` so breadcrumbs terminate at the DM Container, matching every other Sigma browsepath builder. - M3: Remove 500 from `_get_data_model_lineage_entries` silent_statuses so a degraded Sigma region surfaces loud warnings instead of zero lineage with zero telemetry. - M5: DataModelKey docstring clarifies that multi-env operators should use `platform_instance` to disambiguate DM Container URNs. - M4: `sigma_pre.md` notes that external-upstream resolution requires the referenced Sigma Dataset to be ingested in the same run, plus the multi-env Container caveat. - m2 (polish): Move the `data_model_pattern`-ignored check from the config-level `model_validator` into `SigmaSource.__init__` and emit it via `self.reporter.warning(...)` so CI / dashboards gating on report warnings see it. - m7 (polish): Cap per-endpoint malformed-entry warnings at 10 while tallying the full drop count on `pagination_malformed_entries_dropped` to prevent warning-storm on a vendor-wide regression. - New counters: `pagination_duplicate_entries_dropped`, `pagination_malformed_entries_dropped`. - New tests: dedup behavior (typed + lineage), 500 surfaces warning, workspace-payload fallback, malformed-entry rate limit.

… test ``_log_http_error`` reads ``e.response.status_code`` off the exception, which real ``requests`` populates on ``raise_for_status`` failures. The unit test was constructing ``HTTPError`` directly with only a message, so ``e.response`` was ``None`` and the test tripped an ``AttributeError`` inside the logger before the warning was recorded. Attach the mocked 500 response onto the HTTPError so the code path under test (M3: surface a warning on /lineage 5xx) is exercised as in production.

…act_lineage gate Review follow-ups on PR #17141 PR 1: C1: `_get_data_model_lineage_entries` dedup key was `(type, nodeId)`, but the real DM `/lineage` payload does not carry `nodeId` -- `element` rows use `elementId` and `dataset`/`table` rows use `inodeId`. Under the old key every real row collapsed to `(<type>, "")` so after the first entry per type everything was silently dropped. Switch to a shape-aware key and preserve rows missing the expected identifier rather than collapsing them. M7: Rewrote the dedup unit tests against the real DM /lineage shape (elementId / inodeId) and added a regression for entries missing the natural key. Tighten test_sigma_ingest_data_models_external_dataset_not_ingested to assert exact counter values (1 intra, 1 resolved external, 1 unresolved external, 0 unknown shape) so C1 cannot regress silently. M1: Split `data_model_element_upstreams_unresolved` into `_unresolved_external` (target dataset exists but was filtered out of the run) vs `_unknown_shape` (source_id shapes this release does not parse, e.g. cross-DM refs). Aggregate counter kept for dashboards. M2: Docstring on `_get_data_model_lineage_entries` now reflects the shape-aware key choice and the rationale for preserving entries with missing identifiers. M5: Duplicate fieldPath collapse is now deterministic -- prefer the row with a formula set (user-authored calc field), tie-break on the lexicographically smallest columnId. Dropped columnIds are emitted to debug logs for reconciliation. M6: `/dataModels/{id}/lineage` is now gated on `extract_lineage` so `ingest_data_models=True` + `extract_lineage=False` emits DM Containers, element Datasets, and SchemaMetadata without hitting any lineage endpoint. M4: Softened hard-coded `StringTypeClass` / `nativeDataType="String"` on DM element columns to `NullTypeClass` + `"unknown"` via a named constant, so downstream type-aware features can distinguish "no type returned by Sigma" from "actually a string." Doc updates: sigma_pre.md covers the new counter split, the extract_lineage gate, and the softened column type; updating-datahub.md bullet reflects the same.

…ource_id Mirror the success-path dedup in `_gen_data_model_element_upstream_lineage` so duplicate `source_ids` in a single element's vendor payload (diamond reference to the same un-ingested `inode-X`) no longer over-count the `data_model_element_upstreams_unresolved` and `data_model_element_upstreams_unresolved_external` / `_unknown_shape` buckets. Emitted aspects are unaffected (they were already deduped via the `seen` set); only report-counter correctness changes. Also add two integration tests: - `test_sigma_ingest_data_models_extract_lineage_false` pins the `ingest_data_models=True + extract_lineage=False` opt-out (no `/lineage` HTTP call, Container + element Datasets + SchemaMetadata still emitted, no `upstreamLineage` aspects). - `test_sigma_ingest_data_models_unresolved_counters_dedup_duplicate_source_id` regresses the counter dedup fix above. Scrub a handful of code/test comments that referenced internal review- round labels (`Review M*`, `Review C1`).

C1: Revert unintentional chart-inputs lexicographic sort in _gen_elements_workunit. The sorted(...) slipped into this DM-focused PR would churn every Sigma chart's ChartInfo aspect on first re-ingest after upgrade for any tenant whose inputs were not already alphabetical. C2: Unify data_model.workspaceId handling between filtering and rendering in _assemble_data_model. When /files and /dataModels disagree, DMs were filtered under the /files workspace but rendered / counted under the /dataModels payload workspace -- a split-brain the new resolved_workspace_id parameter closes. M1: Emit OwnershipClass on DM element Datasets when ingest_owner is enabled and data_model.createdBy resolves, so 'datasets owned by X' DataHub filters no longer miss DM elements despite the author appearing on the enclosing Container. M2: Wrap per-DM assembly in a try/except so one malformed DM does not abort the whole feed. Mirrors get_sigma_workbooks. M3: Drop ignoreCase from the default-pattern detector -- a user who pins ignoreCase: False on an otherwise-default pattern is semantically still 'match everything' and must not trigger the 'data_model_pattern ignored' warning. M5: Merge sourceIds on shape-aware dedup collisions in _get_data_model_lineage_entries rather than silently dropping the second occurrence. Preserves all upstreams if Sigma ever splits one element's lineage across multiple rows. M6: Populate sigma_dataset_urn_by_url_id eagerly before DM iteration in get_workunits_internal so DM external-upstream resolution does not depend on pipeline generator drain order. Tests: unit coverage for C1 (insertion-order pin), ignoreCase-False-default non-warning, lineage sourceIds union on collision, eager dataset-URN-map population, DM element owner emission (3 cases), and per-DM isolation on assembly failure. Integration coverage for C2 (mismatched /files vs /dataModels workspaceId).

github-actions · 2026-04-24T07:53:01Z

Linear: ING-2416

Two DM-ingest integration tests filtered MCEs by substring-matching the DM urlId (CDJLIyOhUoKBSEVI8Wr4n) against each entityUrn. But DM element URNs embed the immutable dataModelId UUID (147a4d09-a686-4eea-b183-9b82aa0f7beb) per the URN-keying invariant documented in updating-datahub.md -- urlId never appears in a URN. That made the negative assertions pass tautologically (empty filter list) and the one positive assertion in test_sigma_ingest_data_models_extract_lineage_false fail hard (asserted a non-empty list that was always empty): assert element_schema_aspects, 'DM element SchemaMetadata should ...' Switch the substring filter to the dataModelId UUID so: - the positive assertion actually sees the three DM element SchemaMetadata aspects, and - the paired negative assertion on UpstreamLineage becomes a real regression guard instead of tautologically empty. Also tighten the same filter in test_sigma_ingest_data_models_lineage_http_error for consistency.

Two integration tests duplicate coverage already pinned at the unit layer, where the same invariants run faster and isolate the assertion to the exact code path under test. Dropped: - test_sigma_ingest_data_models_next_page_token_pagination Pinned: get_data_models walks both pages when first page advertises a nextPageToken. Already covered by TestPaginatedRawEntries::test_collects_entries_across_pages_via_nextPageToken (and the cycle-guard sibling), which test the same paginator code path on the same _paginated_raw_entries helper that get_data_models uses. - test_sigma_ingest_data_models_schema_duplicate_fieldpath_dropped Pinned: a DM element with two columns sharing the same fieldPath emits one SchemaMetadata field and bumps the dedup counter. Already covered by TestSchemaMetadataEmission::test_duplicate_fieldpath_tiebreak_* which assert len(fields) == 1 and the counter increment, plus the exact tie-break rule (formula wins, then smallest columnId). Tests not dropped despite a partial overlap: - test_sigma_ingest_data_models_unresolved_counters_dedup_duplicate_source_id exercises a *different* code path (the unresolved_seen set inside the per-element source_id walk in sigma.py), not the row-level dedup that TestLineageDedupMerge tests at the API layer. Kept. - test_sigma_ingest_data_models_default_off pins the *default value* of ingest_data_models (omits the flag from config), distinct from test_sigma_ingest_data_models_disabled which sets it explicitly to False. If the default ever flips, only the former fires. Kept. Net: -184 LOC in tests/integration/sigma/test_sigma.py with zero coverage loss; the corresponding 9 unit tests pass locally.

alwaysmeticulous · 2026-04-24T23:44:02Z

🔴 Meticulous spotted visual differences in 9 of 1455 screens tested: view and approve differences detected.

Meticulous evaluated ~10 hours of user flows against your PR.

_{Last updated for commit b5fc763 ruff fixed. This comment will update as new commits are pushed.}

codecov · 2026-04-25T04:02:38Z

Bundle Report

Bundle size has no change ✅

treff7es

Overall it looks good. It would be nice to have the new model entities in golden tests as well

treff7es · 2026-04-27T07:06:03Z

+    """
+
+    def _dm(self, **overrides: object) -> SigmaDataModel:
+        import datetime as _dt


Imports should be at the top

treff7es · 2026-04-27T07:06:39Z

+    # Same golden as the baseline test — the ingest_data_models=False path
+    # should produce identical output to the unconfigured default when no DM
+    # mocks are registered.
+    import json


Imports should be at the top

treff7es · 2026-04-27T07:06:59Z

+    pipeline.run()
+    pipeline.raise_from_status()
+
+    import json


Movet his to the top, please

treff7es · 2026-04-27T07:07:09Z

+        f"{[r.url for r in lineage_hits]}"
+    )
+
+    import json


treff7es · 2026-04-27T07:07:17Z

+    report = _sigma_report(pipeline)
+    assert report.data_models_without_workspace == 1
+
+    import json


treff7es · 2026-04-27T07:07:22Z

+    pipeline.run()
+    pipeline.raise_from_status()
+
+    import json


treff7es · 2026-04-27T07:07:27Z

+    pipeline.run()
+    pipeline.raise_from_status()
+
+    import json


treff7es · 2026-04-27T07:09:04Z

+    Callable,
+    Deque,
+    Dict,
+    Hashable,


Since Python 3.9, typing.Hashable is a deprecated alias for collections.abc.Hashable (PEP 585). Ruff rule UP035 flags this.

…s.abc.Hashable typing.Hashable is a deprecated alias since Python 3.9 (PEP 585). Use collections.abc.Hashable directly as flagged by ruff UP035.

Address reviewer feedback: ``import json`` and ``import datetime as _dt`` were scattered inside test functions; move them to the module-level import block where they belong.

github-actions Bot added ingestion PR or Issue related to the ingestion of metadata docs Issues and Improvements to docs labels Apr 24, 2026

github-actions Bot deployed to datahub-wheels (Preview) April 24, 2026 04:01 View deployment

style: ruff format test_sigma.py

9a0be9a

github-actions Bot deployed to datahub-wheels (Preview) April 24, 2026 04:06 View deployment

kyungsoo-datahub marked this pull request as draft April 24, 2026 04:16

vercel Bot deployed to Preview April 24, 2026 04:19 View deployment

kyungsoo-datahub force-pushed the feat/sigma-dm-containers branch from ff9f727 to 9a0be9a Compare April 24, 2026 04:54

github-actions Bot deployed to datahub-wheels (Preview) April 24, 2026 04:57 View deployment

github-actions Bot deployed to datahub-wheels (Preview) April 24, 2026 05:06 View deployment

github-actions Bot deployed to datahub-wheels (Preview) April 24, 2026 05:19 View deployment

github-actions Bot deployed to datahub-wheels (Preview) April 24, 2026 05:28 View deployment

vercel Bot deployed to Preview April 24, 2026 05:41 View deployment

github-actions Bot deployed to datahub-wheels (Preview) April 24, 2026 05:54 View deployment

vercel Bot deployed to Preview April 24, 2026 06:08 View deployment

github-actions Bot deployed to datahub-wheels (Preview) April 24, 2026 06:10 View deployment

github-actions Bot deployed to datahub-wheels (Preview) April 24, 2026 06:22 View deployment

vercel Bot deployed to Preview April 24, 2026 06:34 View deployment

github-actions Bot deployed to datahub-wheels (Preview) April 24, 2026 06:38 View deployment

github-actions Bot deployed to datahub-wheels (Preview) April 24, 2026 07:26 View deployment

vercel Bot deployed to Preview April 24, 2026 07:38 View deployment

kyungsoo-datahub marked this pull request as ready for review April 24, 2026 07:52

github-actions Bot deployed to datahub-wheels (Preview) April 24, 2026 07:57 View deployment

vercel Bot deployed to Preview April 24, 2026 08:14 View deployment

maggiehays added the needs-review Label for PRs that need review from a maintainer. label Apr 24, 2026

github-actions Bot deployed to datahub-wheels (Preview) April 24, 2026 23:41 View deployment

github-actions Bot deployed to datahub-project-web-react (Preview) April 24, 2026 23:43 View deployment

vercel Bot deployed to Preview April 24, 2026 23:56 View deployment

treff7es approved these changes Apr 27, 2026

View reviewed changes

maggiehays added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Apr 27, 2026

kyungsoo-datahub added 2 commits April 27, 2026 06:28

fix(ingest/sigma): replace deprecated typing.Hashable with collection…

bdcd660

…s.abc.Hashable typing.Hashable is a deprecated alias since Python 3.9 (PEP 585). Use collections.abc.Hashable directly as flagged by ruff UP035.

fix(ingest/sigma): move inline imports to module top in DM tests

0d07704

Address reviewer feedback: ``import json`` and ``import datetime as _dt`` were scattered inside test functions; move them to the module-level import block where they belong.

github-actions Bot deployed to datahub-wheels (Preview) April 27, 2026 13:30 View deployment

github-actions Bot deployed to datahub-project-web-react (Preview) April 27, 2026 13:33 View deployment

ruff fixed

b5fc763

github-actions Bot deployed to datahub-wheels (Preview) April 27, 2026 13:40 View deployment

github-actions Bot deployed to datahub-project-web-react (Preview) April 27, 2026 13:43 View deployment

Merge branch 'master' into feat/sigma-dm-containers

6eceb7c

github-actions Bot deployed to datahub-wheels (Preview) April 27, 2026 13:48 View deployment

vercel Bot deployed to Preview April 27, 2026 14:01 View deployment

kyungsoo-datahub merged commit 58aab9b into master Apr 27, 2026
51 of 52 checks passed

kyungsoo-datahub deleted the feat/sigma-dm-containers branch April 27, 2026 14:20

Conversation

kyungsoo-datahub commented Apr 24, 2026

Summary

Emitted

Config

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

codecov Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

datahub-connector-tests Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Connector Tests Results

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

alwaysmeticulous Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Apr 25, 2026

Bundle Report

Uh oh!

treff7es left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Apr 24, 2026 •

edited

Loading

datahub-connector-tests Bot commented Apr 24, 2026 •

edited

Loading

alwaysmeticulous Bot commented Apr 24, 2026 •

edited

Loading