feat(ingest/sigma): emit Data Models as Containers with per-element Datasets#17173
feat(ingest/sigma): emit Data Models as Containers with per-element Datasets#17173kyungsoo-datahub merged 17 commits intomasterfrom
Conversation
…atasets Opt-in (`ingest_data_models`, default False) Sigma Data Model ingestion. Each DM becomes a Container and each element inside becomes a Dataset with SchemaMetadata and UpstreamLineage. Lineage covers intra-DM element refs and external upstreams (Sigma Datasets ingested in the same run). Element Dataset URNs are keyed by the immutable Data Model UUID (`urn:li:dataset:(sigma,<dataModelId>.<elementId>,env)`) so attachments survive Sigma slug rotation; the slug is captured on `customProperties.dataModelUrlId`. Additive — no existing URNs change. Cross-DM lineage, personal-space DM discovery, and workbook-to-DM element linking arrive in a follow-up PR.
|
Linear: ING-2413 |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Connector Tests ResultsConnector tests failed for commit To skip connector tests, add the Autogenerated by the connector-tests CI pipeline. |
ff9f727 to
9a0be9a
Compare
- _paginated_entries: per-entry try/except on model_validate so a single malformed element/column cannot empty the whole page list, and report.warning on both per-entry drop and outer pagination failure so operators see the issue in the ingestion report (it was previously only debug-logged via _log_http_error). Partial results collected before a mid-pagination HTTP error are preserved. - get_data_models: delegate the outer pagination to _paginated_entries (remove the duplicated nextPage/nextPageToken loop and the inlined per-entry ValidationError handling) so future pagination fixes are applied in one place. - SigmaSourceConfig: add a model_validator that warns when data_model_pattern is customized while ingest_data_models=False — the pattern is silently ignored today, which has tricked operators. - DatasetProperties for DM elements: use description=None instead of "" so aspect-replace semantics do not blank out a user-edited description on re-ingestion. - Document the paths[1:] drop-workspace-prefix intent in sigma.py and the vizualizationType API-spelling match in data_classes.py. - Unit tests: cover the file_meta fallback branch in _assemble_data_model (filled-from-file, DM-payload-wins, None-safe) — closes the test-coverage gap called out in review.
mypy inferred the ``base`` dict literals in the new ``TestAssembleDataModelFileMetaFallback`` fixtures as ``dict[str, str]`` because every initial value was a string. ``base.update(overrides)`` then failed to type-check since ``**overrides: object`` produces a ``dict[str, object]`` that cannot widen a ``dict[str, str]``. Annotate both ``base`` dicts explicitly as ``Dict[str, object]`` so the update and later ``model_validate`` calls pass under ``mypy --strict`` baselines without relaxing runtime behavior.
…ination cycle guard
Addresses PR review feedback on the DM ingestion hardening.
M1 (Major): ``gen_containers`` for a Sigma DM received
``description=data_model.description or ""``, so a DM with no
description sent ``""`` to GMS and would blank out any description the
user had edited in the DataHub UI on the next ingest. Pass ``None``
through instead, matching the element-Dataset fix and the Qlik
connector pattern.
M2 (Major): ``_get_data_model_lineage_entries`` read only the first
page of ``/dataModels/{id}/lineage``. For a DM whose lineage spans
multiple pages, elements on page 2+ silently lost their upstreams
(``data_model_element_upstreams_unresolved`` did not bump because the
element's ``source_ids`` were simply absent). Split the HTTP loop out
of ``_paginated_entries`` into ``_paginated_raw_entries`` and have
``/lineage`` consume it. Expected "no lineage" statuses (400/403/404/
500) are still swallowed silently via a new ``silent_statuses``
parameter applied only to the first page.
M3 (Major): ``_paginated_raw_entries`` now tracks already-seen
``nextPage`` / ``nextPageToken`` cursors in a ``Set[str]``. A broken
Sigma proxy that echoes the same cursor on every response used to loop
forever and accumulate duplicates; it now breaks with a report
warning.
m1 (polish): DM element Datasets now carry
``externalUrl=<data_model.url>?:nodeId=<elementId>`` so users can
click from a DM-element Dataset back into Sigma. Mirrors the
deep-link shape used for workbook elements in ``get_page_elements``.
m3 (polish): ``SigmaDataModelElement._discard_api_bare_string_columns``
now emits a ``logger.debug`` when it drops bare-string columns, so an
operator investigating an empty-schema DM element can see the columns
were intentionally deferred to ``/columns``.
m4 (polish): ``TestAssembleDataModelFileMetaFallback`` replaces the
``patch.stopall()`` try/finally pattern with a nested
``with patch.object(...)`` context-manager helper, guaranteeing
cleanup even if an assertion raises mid-test.
Tests: adds ``TestPaginatedRawEntries`` covering multi-page via
``nextPage`` and ``nextPageToken``, the cycle-protection break, the
silent-status swallow, and an end-to-end multi-page ``/lineage``
regression for M2.
…h shape, report hygiene Review feedback on the DM ingestion path: - C1: `_paginated_entries` accepts an optional natural-key dedup and the DM element/column/data-model callers opt in on elementId/columnId/ dataModelId. Raw DM lineage entries dedupe on `(type, nodeId)` inside `_get_data_model_lineage_entries`. Cycle-echo or server-side page overlap no longer double-emits MCPs or inflates counters. - C2: `get_data_models` falls back to `data_model.workspaceId` when `/files` lacks a workspace row, matching how the payload is already honored for emission and customProperties. - M2: Drop the trailing element-name entry from DM element `BrowsePathsV2` so breadcrumbs terminate at the DM Container, matching every other Sigma browsepath builder. - M3: Remove 500 from `_get_data_model_lineage_entries` silent_statuses so a degraded Sigma region surfaces loud warnings instead of zero lineage with zero telemetry. - M5: DataModelKey docstring clarifies that multi-env operators should use `platform_instance` to disambiguate DM Container URNs. - M4: `sigma_pre.md` notes that external-upstream resolution requires the referenced Sigma Dataset to be ingested in the same run, plus the multi-env Container caveat. - m2 (polish): Move the `data_model_pattern`-ignored check from the config-level `model_validator` into `SigmaSource.__init__` and emit it via `self.reporter.warning(...)` so CI / dashboards gating on report warnings see it. - m7 (polish): Cap per-endpoint malformed-entry warnings at 10 while tallying the full drop count on `pagination_malformed_entries_dropped` to prevent warning-storm on a vendor-wide regression. - New counters: `pagination_duplicate_entries_dropped`, `pagination_malformed_entries_dropped`. - New tests: dedup behavior (typed + lineage), 500 surfaces warning, workspace-payload fallback, malformed-entry rate limit.
… test ``_log_http_error`` reads ``e.response.status_code`` off the exception, which real ``requests`` populates on ``raise_for_status`` failures. The unit test was constructing ``HTTPError`` directly with only a message, so ``e.response`` was ``None`` and the test tripped an ``AttributeError`` inside the logger before the warning was recorded. Attach the mocked 500 response onto the HTTPError so the code path under test (M3: surface a warning on /lineage 5xx) is exercised as in production.
…act_lineage gate Review follow-ups on PR #17141 PR 1: C1: `_get_data_model_lineage_entries` dedup key was `(type, nodeId)`, but the real DM `/lineage` payload does not carry `nodeId` -- `element` rows use `elementId` and `dataset`/`table` rows use `inodeId`. Under the old key every real row collapsed to `(<type>, "")` so after the first entry per type everything was silently dropped. Switch to a shape-aware key and preserve rows missing the expected identifier rather than collapsing them. M7: Rewrote the dedup unit tests against the real DM /lineage shape (elementId / inodeId) and added a regression for entries missing the natural key. Tighten test_sigma_ingest_data_models_external_dataset_not_ingested to assert exact counter values (1 intra, 1 resolved external, 1 unresolved external, 0 unknown shape) so C1 cannot regress silently. M1: Split `data_model_element_upstreams_unresolved` into `_unresolved_external` (target dataset exists but was filtered out of the run) vs `_unknown_shape` (source_id shapes this release does not parse, e.g. cross-DM refs). Aggregate counter kept for dashboards. M2: Docstring on `_get_data_model_lineage_entries` now reflects the shape-aware key choice and the rationale for preserving entries with missing identifiers. M5: Duplicate fieldPath collapse is now deterministic -- prefer the row with a formula set (user-authored calc field), tie-break on the lexicographically smallest columnId. Dropped columnIds are emitted to debug logs for reconciliation. M6: `/dataModels/{id}/lineage` is now gated on `extract_lineage` so `ingest_data_models=True` + `extract_lineage=False` emits DM Containers, element Datasets, and SchemaMetadata without hitting any lineage endpoint. M4: Softened hard-coded `StringTypeClass` / `nativeDataType="String"` on DM element columns to `NullTypeClass` + `"unknown"` via a named constant, so downstream type-aware features can distinguish "no type returned by Sigma" from "actually a string." Doc updates: sigma_pre.md covers the new counter split, the extract_lineage gate, and the softened column type; updating-datahub.md bullet reflects the same.
…ource_id Mirror the success-path dedup in `_gen_data_model_element_upstream_lineage` so duplicate `source_ids` in a single element's vendor payload (diamond reference to the same un-ingested `inode-X`) no longer over-count the `data_model_element_upstreams_unresolved` and `data_model_element_upstreams_unresolved_external` / `_unknown_shape` buckets. Emitted aspects are unaffected (they were already deduped via the `seen` set); only report-counter correctness changes. Also add two integration tests: - `test_sigma_ingest_data_models_extract_lineage_false` pins the `ingest_data_models=True + extract_lineage=False` opt-out (no `/lineage` HTTP call, Container + element Datasets + SchemaMetadata still emitted, no `upstreamLineage` aspects). - `test_sigma_ingest_data_models_unresolved_counters_dedup_duplicate_source_id` regresses the counter dedup fix above. Scrub a handful of code/test comments that referenced internal review- round labels (`Review M*`, `Review C1`).
C1: Revert unintentional chart-inputs lexicographic sort in _gen_elements_workunit. The sorted(...) slipped into this DM-focused PR would churn every Sigma chart's ChartInfo aspect on first re-ingest after upgrade for any tenant whose inputs were not already alphabetical. C2: Unify data_model.workspaceId handling between filtering and rendering in _assemble_data_model. When /files and /dataModels disagree, DMs were filtered under the /files workspace but rendered / counted under the /dataModels payload workspace -- a split-brain the new resolved_workspace_id parameter closes. M1: Emit OwnershipClass on DM element Datasets when ingest_owner is enabled and data_model.createdBy resolves, so 'datasets owned by X' DataHub filters no longer miss DM elements despite the author appearing on the enclosing Container. M2: Wrap per-DM assembly in a try/except so one malformed DM does not abort the whole feed. Mirrors get_sigma_workbooks. M3: Drop ignoreCase from the default-pattern detector -- a user who pins ignoreCase: False on an otherwise-default pattern is semantically still 'match everything' and must not trigger the 'data_model_pattern ignored' warning. M5: Merge sourceIds on shape-aware dedup collisions in _get_data_model_lineage_entries rather than silently dropping the second occurrence. Preserves all upstreams if Sigma ever splits one element's lineage across multiple rows. M6: Populate sigma_dataset_urn_by_url_id eagerly before DM iteration in get_workunits_internal so DM external-upstream resolution does not depend on pipeline generator drain order. Tests: unit coverage for C1 (insertion-order pin), ignoreCase-False-default non-warning, lineage sourceIds union on collision, eager dataset-URN-map population, DM element owner emission (3 cases), and per-DM isolation on assembly failure. Integration coverage for C2 (mismatched /files vs /dataModels workspaceId).
|
Linear: ING-2416 |
Two DM-ingest integration tests filtered MCEs by substring-matching the DM urlId (CDJLIyOhUoKBSEVI8Wr4n) against each entityUrn. But DM element URNs embed the immutable dataModelId UUID (147a4d09-a686-4eea-b183-9b82aa0f7beb) per the URN-keying invariant documented in updating-datahub.md -- urlId never appears in a URN. That made the negative assertions pass tautologically (empty filter list) and the one positive assertion in test_sigma_ingest_data_models_extract_lineage_false fail hard (asserted a non-empty list that was always empty): assert element_schema_aspects, 'DM element SchemaMetadata should ...' Switch the substring filter to the dataModelId UUID so: - the positive assertion actually sees the three DM element SchemaMetadata aspects, and - the paired negative assertion on UpstreamLineage becomes a real regression guard instead of tautologically empty. Also tighten the same filter in test_sigma_ingest_data_models_lineage_http_error for consistency.
Two integration tests duplicate coverage already pinned at the unit layer, where the same invariants run faster and isolate the assertion to the exact code path under test. Dropped: - test_sigma_ingest_data_models_next_page_token_pagination Pinned: get_data_models walks both pages when first page advertises a nextPageToken. Already covered by TestPaginatedRawEntries::test_collects_entries_across_pages_via_nextPageToken (and the cycle-guard sibling), which test the same paginator code path on the same _paginated_raw_entries helper that get_data_models uses. - test_sigma_ingest_data_models_schema_duplicate_fieldpath_dropped Pinned: a DM element with two columns sharing the same fieldPath emits one SchemaMetadata field and bumps the dedup counter. Already covered by TestSchemaMetadataEmission::test_duplicate_fieldpath_tiebreak_* which assert len(fields) == 1 and the counter increment, plus the exact tie-break rule (formula wins, then smallest columnId). Tests not dropped despite a partial overlap: - test_sigma_ingest_data_models_unresolved_counters_dedup_duplicate_source_id exercises a *different* code path (the unresolved_seen set inside the per-element source_id walk in sigma.py), not the row-level dedup that TestLineageDedupMerge tests at the API layer. Kept. - test_sigma_ingest_data_models_default_off pins the *default value* of ingest_data_models (omits the flag from config), distinct from test_sigma_ingest_data_models_disabled which sets it explicitly to False. If the default ever flips, only the former fires. Kept. Net: -184 LOC in tests/integration/sigma/test_sigma.py with zero coverage loss; the corresponding 9 unit tests pass locally.
|
🔴 Meticulous spotted visual differences in 9 of 1455 screens tested: view and approve differences detected. Meticulous evaluated ~10 hours of user flows against your PR. Last updated for commit |
Bundle ReportBundle size has no change ✅ |
treff7es
left a comment
There was a problem hiding this comment.
Overall it looks good. It would be nice to have the new model entities in golden tests as well
| """ | ||
|
|
||
| def _dm(self, **overrides: object) -> SigmaDataModel: | ||
| import datetime as _dt |
There was a problem hiding this comment.
Imports should be at the top
| # Same golden as the baseline test — the ingest_data_models=False path | ||
| # should produce identical output to the unconfigured default when no DM | ||
| # mocks are registered. | ||
| import json |
There was a problem hiding this comment.
Imports should be at the top
| pipeline.run() | ||
| pipeline.raise_from_status() | ||
|
|
||
| import json |
There was a problem hiding this comment.
Movet his to the top, please
| f"{[r.url for r in lineage_hits]}" | ||
| ) | ||
|
|
||
| import json |
| report = _sigma_report(pipeline) | ||
| assert report.data_models_without_workspace == 1 | ||
|
|
||
| import json |
| pipeline.run() | ||
| pipeline.raise_from_status() | ||
|
|
||
| import json |
| pipeline.run() | ||
| pipeline.raise_from_status() | ||
|
|
||
| import json |
| Callable, | ||
| Deque, | ||
| Dict, | ||
| Hashable, |
There was a problem hiding this comment.
Since Python 3.9, typing.Hashable is a deprecated alias for collections.abc.Hashable (PEP 585). Ruff rule UP035 flags this.
…s.abc.Hashable typing.Hashable is a deprecated alias since Python 3.9 (PEP 585). Use collections.abc.Hashable directly as flagged by ruff UP035.
Address reviewer feedback: ``import json`` and ``import datetime as _dt`` were scattered inside test functions; move them to the module-level import block where they belong.
Summary
Opt-in Sigma Data Model ingestion. DM →
Container; each element →DatasetwithSchemaMetadataandUpstreamLineage(intra-DM + external Sigma Dataset refs).Element URNs key off the immutable DM UUID:
urn:li:dataset:(sigma,<dataModelId>.<elementId>,env). Additive; no existing URNs change.Emitted
Container(subtypeSigma Data Model)Dataset(subtypeSigma Data Model Element) withSchemaMetadataUpstreamLineage(intra-DM + external Sigma Datasets)Config
ingest_data_models(defaultFalse)data_model_pattern