feat(search): add local embedding provider for on-premise semantic search (Ollama) by shirshanka · Pull Request #17201 · datahub-project/datahub

shirshanka · 2026-04-26T20:50:26Z

Summary

Adds a local embedding provider so DataHub can run fully on-premise semantic search without cloud API keys. The provider calls any locally-running OpenAI-compatible embeddings server — primarily Ollama — using java.net.http.HttpClient and Jackson (zero new dependencies).

New Java provider: LocalEmbeddingProvider implements EmbeddingProvider via HTTP to {endpoint}/v1/embeddings
Docker Compose: new docker-compose.ollama.yml with ollama service + ollama-model-init one-shot container that pulls and warms up the model on first start
Gradle task: quickstartDebugAi activates both debug and debug-ai profiles so Ollama starts alongside GMS
Python ingestion: chunking_source.py / chunking_config.py support provider=local via litellm's OpenAI-compatible routing
Dev tooling: datahub-dev.sh start --ai spins up the full managed stack and blocks until Ollama's model is warm; --embeddings-endpoint allows BYO server; --no-ai clears the env vars

Default model: nomic-embed-text (768 dimensions) — best OSS quality/speed balance available on Ollama.

How it works

GMS → LocalEmbeddingProvider → HTTP POST /v1/embeddings → Ollama (nomic-embed-text)
                                                         → LM Studio / llama.cpp / any compatible server

Quick start (managed)

scripts/dev/datahub-dev.sh start --ai
# Ollama starts, model is pulled and warmed up, GMS starts with semantic search enabled.
# First search query has no cold-start delay.

BYO existing server

scripts/dev/datahub-dev.sh start --ai \
  --embeddings-endpoint http://localhost:11434/v1/embeddings \
  --embeddings-model mxbai-embed-large

Configuration (env vars)

Env var	Default	Description
`EMBEDDING_PROVIDER_TYPE`	`openai`	Set to `local`
`LOCAL_EMBEDDING_ENDPOINT`	`http://localhost:11434/v1/embeddings`	Embeddings server URL
`LOCAL_EMBEDDING_MODEL`	`nomic-embed-text`	Model name (must be pulled on server)
`LOCAL_EMBEDDING_VECTOR_DIMENSION`	`768`	Override when switching to a different-dimension model

Actionable error messages

Server not running → "Cannot connect to … Is it running? Start Ollama with: ollama serve"
Model not pulled → "Model not pulled? Try: ollama pull nomic-embed-text"

Files changed

File	Change
`metadata-io/.../LocalEmbeddingProvider.java`	New provider (zero new deps)
`metadata-io/.../LocalEmbeddingProviderTest.java`	15 unit tests
`metadata-service/.../EmbeddingProviderConfiguration.java`	Added `LocalConfig` nested class
`metadata-service/.../EmbeddingProviderFactory.java`	Added `case "local"`
`metadata-service/.../application.yaml`	Added `local:` block + `nomic_embed_text` model entry
`docker/profiles/docker-compose.ollama.yml`	New Ollama service + warmup init container
`docker/profiles/docker-compose.yml`	Include ollama compose file
`docker/build.gradle`	New `quickstartDebugAi` task (debug + debug-ai profiles)
`metadata-ingestion/.../chunking_config.py`	`local` provider support, endpoint field, constant
`metadata-ingestion/.../chunking_source.py`	Local provider dispatch, model_key fix, validate_provider fix
`scripts/dev/datahub_dev.py`	`--ai`, `--no-ai`, `--embeddings-endpoint`, `--embeddings-model`, model-ready wait
`smoke-test/tests/semantic/test_local_embedding_provider.py`	4 smoke tests (gate: `LOCAL_EMBEDDING_PROVIDER_TESTS=true`)

Test plan

./gradlew :metadata-io:test --tests "*LocalEmbeddingProvider*" — 15 unit tests pass
scripts/dev/datahub-dev.sh start --ai — Ollama starts, model pulled and warmed, GMS healthy
Smoke tests (LOCAL_EMBEDDING_PROVIDER_TESTS=true):
- Connectivity probe ✓
- nomic_embed_text key with 768-dim vectors confirmed ✓
- Semantic ranking correct: "Data Access Request Process" ranks above unrelated docs for "how to request data access permissions" ✓
./gradlew :metadata-ingestion:lintFix — clean

🤖 Generated with Claude Code

…arch Adds a `local` embedding provider that calls any OpenAI-compatible embedding server (primarily Ollama) without cloud API keys, enabling fully on-premise semantic search deployments. ## Changes ### Java backend - `LocalEmbeddingProvider`: HTTP client using `java.net.http.HttpClient` to call Ollama's `/v1/embeddings` endpoint; ConnectException gives actionable "ollama serve" hint; 404 gives "ollama pull <model>" hint - `EmbeddingProviderConfiguration`: added `LocalConfig` with `endpoint` and `model` fields defaulting to Ollama + nomic-embed-text - `EmbeddingProviderFactory`: added `case "local"` dispatch - `application.yaml`: added `local:` config block and `nomic_embed_text` (768d) model entry under `embeddingProvider.models` ### Docker - `docker-compose.ollama.yml`: new `ollama` service (port 11434, profile `debug-ai`/`quickstart-ai`) + `ollama-model-init` one-shot model puller - `docker-compose.yml`: includes the new ollama compose file - `build.gradle`: new `quickstartDebugAi` Gradle task activating both `debug` and `debug-ai` profiles simultaneously ### Python ingestion - `chunking_config.py`: added `"local"` to provider enum, `endpoint` field, `from_server()` support reading `LOCAL_EMBEDDING_ENDPOINT` env var - `chunking_source.py`: local provider branch using litellm `openai/<model>` routing with `api_base` derived from endpoint and `api_key="local"` ### Dev tooling - `datahub_dev.sh` / `datahub_dev.py`: `--ai` flag sets 5 required env vars (provider type, semantic search flags, Ollama endpoint/model) and activates the `quickstartDebugAi` Gradle task ### Tests - `LocalEmbeddingProviderTest.java`: 12 unit tests covering success, retries, ConnectException hint, 404 hint, invalid JSON - `test_local_embedding_provider.py`: 4 smoke tests verifying Ollama reachability, embedding key/dimension correctness, and semantic ranking Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Split LocalEmbeddingProvider timeout into CONNECT_TIMEOUT=10s and REQUEST_TIMEOUT=120s; cold GGUF model loading can take 60s+ so a longer read timeout is needed while the connect timeout stays short - Fix ConnectException detection: HttpClient sometimes wraps it inside IOException; add getCause() check so the helpful "ollama serve" hint fires in both cases; extract to newConnectError() helper - Add testWrappedConnectException and testIoExceptionRetryExhausted tests (15 total, was 12) - Fix _validate_provider_config: add 'local' branch so test_connection correctly reports capability instead of always returning False - Fix SemanticContent model_key: use model_embedding_key from server config when available (authoritative), fall back to derivation only when not set - Extract _LOCAL_EMBEDDING_DEFAULT_ENDPOINT constant in chunking_config.py to keep Java and Python defaults in sync - Ollama-model-init: add warmup embedding request after model pull so GGUF is loaded into memory before the container exits; add restart: "no" - application.yaml: make nomic_embed_text vectorDimension configurable via LOCAL_EMBEDDING_VECTOR_DIMENSION env var - datahub_dev.py: add --no-ai flag to clear AI env vars; add --embeddings-endpoint (BYO server, skips Ollama container) and --embeddings-model flags; add _wait_for_ollama_model_ready() probe so 'start --ai' blocks until model is loaded and the first search query is warm; future AI capabilities (chat etc.) can add --chat-endpoint Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-04-26T20:50:35Z

Linear: PFP-3533

codecov · 2026-04-26T20:55:29Z

Codecov Report

❌ Patch coverage is 88.09524% with 15 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...tory/search/semantic/EmbeddingProviderFactory.java	0.00%	6 Missing ⚠️
...adata/search/embedding/LocalEmbeddingProvider.java	94.25%	2 Missing and 3 partials ⚠️
...b/ingestion/source/unstructured/chunking_config.py	72.72%	3 Missing ⚠️
.../config/search/EmbeddingProviderConfiguration.java	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

New workflow docker-quickstart-ai.yml: - Triggers on changes to LocalEmbeddingProvider, EmbeddingProviderFactory, application.yaml, docker-compose.ollama.yml, chunking_*.py, and the semantic smoke tests — plus nightly schedule and workflow_dispatch - Builds DataHub quickstart images from source (:docker:buildImagesQuickstart) - Starts docker compose with both quickstart-consumers and quickstart-ai profiles simultaneously (GMS + core services + Ollama) - Injects AI env vars into GMS via DATAHUB_LOCAL_COMMON_ENV temp file (EMBEDDING_PROVIDER_TYPE=local, semantic search flags, Ollama endpoint) - Waits for Ollama model to be fully loaded via host-side probe loop (40 retries × 10s = up to 400s) before running tests - Runs tests/semantic/test_local_embedding_provider.py with LOCAL_EMBEDDING_PROVIDER_TESTS=true - Supports --embedding-model dispatch input to test non-default models - Uploads JUnit XML results (30d retention) and Docker logs on failure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ollama-model-init exits with code 0 after pulling and warming the model. docker compose --wait treats any container exit as failure, which breaks the startup step even on success. Replace with explicit health polling: - GMS readiness: curl localhost:8080/health (90x10s = 900s max) - Ollama model readiness: embed probe (60x10s = 600s max) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Pin pytest-timeout to exact version to satisfy dep-pinning CI check - Fix mypy error: raw_endpoint or-chain now ends on a str literal so mypy correctly infers str (not str | None) for the slice/endswith ops Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Cover the local provider branch in chunking_source.py: model prefixing, api_base derivation from config/env/fallback, model_embedding_key normalisation, and provider-config validation. Brings patch coverage above the 75% Codecov threshold. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…sitive PropertiesCollectorConfigurationTest requires all new config properties to be explicitly listed as sensitive or non-sensitive. The LocalConfig endpoint and model fields, plus the nomic_embed_text model index settings, are non-sensitive operational parameters (no secrets). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mypy 1.17.1 (CI version) doesn't narrow the aspect union through hasattr checks. Replace with getattr chains which are unconditionally safe. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

alexsku · 2026-04-27T05:51:30Z

+
+                # Use litellm.embedding() which works with Bedrock, Cohere, OpenAI, and
+                # local OpenAI-compatible servers.
                response = litellm.embedding(


unrelated to this pr, lately we've seen so many security issues related to litellm, perhaps we should consider alternatives

alexsku · 2026-04-27T05:55:21Z

+    HttpRequest request =
+        HttpRequest.newBuilder()
+            .uri(URI.create(endpoint))
+            .timeout(REQUEST_TIMEOUT)


is request timeout here supposed to save us from the case when the server malfunctioning? should be make it larger? 120 seconds seems too aggressive

alexsku

Nice work — the feature is well-scoped, defaults are unchanged for existing deployments, and CI (including the new dedicated AI smoke workflow) is green. Approving with two real bugs and a few smaller notes the author should look at before merge.

High-priority issues

1. `test_embedding_capability` doesn't pass `api_base` for the `local` provider — connection test routes to `api.openai.com`

File: metadata-ingestion/src/datahub/ingestion/source/unstructured/chunking_source.py:1212

_validate_provider_config returns embedding_model = "openai/<model>" for provider=\"local\" (line 1119-1121). But test_embedding_capability then calls litellm.embedding(model=embedding_model, …) without an api_base:

response = litellm.embedding(
    model=embedding_model,
    input=[test_text],
    api_key=embedding_config.api_key.get_secret_value() if embedding_config.api_key else None,
    aws_region_name=embedding_config.aws_region,
)

litellm interprets openai/<model> as a real OpenAI call and routes to https://api.openai.com/v1/embeddings. With no real key, the test reports the local provider as not capable even when the local server is healthy.

This is reachable in production: notion_source.py:1048 and confluence_source.py:1121 both call DocumentChunkingSource.test_embedding_capability(resolved_config) during test_connection, and resolved_config can resolve to provider=\"local\" via EmbeddingConfig.from_server when the server is configured for local embeddings.

Fix: Mirror _generate_embeddings — when embedding_config.provider == \"local\", derive api_base from embedding_config.endpoint (or LOCAL_EMBEDDING_ENDPOINT), strip /embeddings, and pass api_key=\"local\". A small _resolve_local_litellm_kwargs(embedding_config) helper would let both call sites share the logic. Add a unit test mirroring the existing test_local_provider_api_base_* tests.

2. Smoke test queries a non-existent GraphQL field and silently skips

File: smoke-test/tests/semantic/test_local_embedding_provider.py:179-222

The test queries appConfig.semanticSearchConfig.embeddingProviderType, but embeddingProviderType does not exist in app.semantic.graphql. SemanticSearchConfig exposes embeddingConfig.provider, not embeddingProviderType. Grep confirms this symbol is referenced only in this test file.

GraphQL validation rejects the query → result[\"errors\"] is populated → the if \"errors\" in result: pytest.skip(...) branch fires → the assert provider_type == \"local\" line is never reached. So the test silently always skips, regardless of how the server is actually configured.

Fix: Either add the field to the GraphQL schema (and AppConfigResolver), or change the query to use the existing field:

semanticSearchConfig { embeddingConfig { provider } }

…and assert on provider == \"local\".

Other notes

[medium / correctness] chunking_source.py:200 — At runtime, the local branch unconditionally prepends openai/ to the model name, but _validate_provider_config (line 1120) skips the prefix if it's already present. The OpenAI branch (line 185) handles this consistently. Result: a user who sets model: \"openai/nomic-embed-text\" gets a different model string between validation (openai/nomic-embed-text) and runtime (openai/openai/nomic-embed-text). The test test_local_provider_already_prefixed_model locks in the buggy behavior, but its docstring says the opposite. Apply the same if not model_name.startswith(\"openai/\") guard at runtime.
[medium / operability] docker/profiles/docker-compose.ollama.yml:53 — ollama-model-init shells out to curl for warmup, but the ollama/ollama base image isn't documented to ship curl. CI works around this by polling from the runner, but local quickstartDebugAi users may see the init container marked as failed even when everything works. Consider using ollama run for warmup, or installing curl explicitly.
[low / dx] LocalEmbeddingProvider.java:157 — Authorization: Bearer local is hardcoded. Fine for Ollama, but a user putting an authenticated proxy in front has no way to override it. Consider exposing an optional token field on LocalConfig (default empty → current behavior).
[low / tests] LocalEmbeddingProviderTest.java — testNullTextThrows, testDefaultConstructor, and testTwoArgConstructor test the @Nonnull framework contract / Lombok-style construction. Per AGENTS.md testing conventions these are anti-pattern tests and can be removed.
What's missing: a docs/how/ page for the on-prem semantic search flow with Ollama — the PR description content is great and belongs in the docs site.

Questions

Was embeddingProviderType intended as a GraphQL schema addition? If so, it's missing; if not, the smoke test should query embeddingConfig.provider.
Why does runtime __init__ not strip an existing openai/ prefix for local, while _validate_provider_config does? Intentional or oversight?
Did you verify curl is present in ollama/ollama:latest? CI doesn't depend on it.

Approving — runtime path is solid, default config is unchanged, and the broken test_connection path / silently-skipping smoke test should be quick fixes after merge.

github-actions · 2026-04-27T07:25:20Z

Your PR has been assigned to @shirshanka (shirshanka) for review (PFP-3533).

shirshanka and others added 2 commits April 26, 2026 13:10

github-actions Bot added ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment smoke_test Contains changes related to smoke tests labels Apr 26, 2026

github-actions Bot deployed to datahub-wheels (Preview) April 26, 2026 20:52 View deployment

shirshanka and others added 2 commits April 26, 2026 14:02

fix(ci): use correct pinned SHA for actions/setup-java

6c72414

github-actions Bot deployed to datahub-wheels (Preview) April 26, 2026 21:06 View deployment

fix(lint): sort imports in test_local_embedding_provider.py

f569209

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot deployed to datahub-wheels (Preview) April 26, 2026 21:13 View deployment

fix(lint): apply ruff format to test_local_embedding_provider.py

f151c00

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot deployed to datahub-wheels (Preview) April 26, 2026 21:16 View deployment

ci: register AI Smoke Tests workflow in post-workflow-actions.yml

bf961d4

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot deployed to datahub-wheels (Preview) April 26, 2026 21:20 View deployment

vercel Bot deployed to Preview April 26, 2026 21:32 View deployment

github-actions Bot deployed to datahub-wheels (Preview) April 26, 2026 21:38 View deployment

vercel Bot deployed to Preview April 26, 2026 21:51 View deployment

ci: drop --timeout=120 (pytest-timeout not in smoke-test deps)

d30e4ad

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot deployed to datahub-wheels (Preview) April 26, 2026 21:56 View deployment

ci: add pytest-timeout to smoke-test deps; restore --timeout=120

cb60230

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot deployed to datahub-wheels (Preview) April 26, 2026 22:02 View deployment

vercel Bot deployed to Preview April 26, 2026 22:14 View deployment

github-actions Bot deployed to datahub-wheels (Preview) April 26, 2026 22:27 View deployment

vercel Bot deployed to Preview April 26, 2026 22:40 View deployment

github-actions Bot deployed to datahub-wheels (Preview) April 26, 2026 22:50 View deployment

vercel Bot deployed to Preview April 26, 2026 23:03 View deployment

shirshanka and others added 2 commits April 26, 2026 19:42

style(test): apply ruff formatting to test_chunking_source.py

379f19e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot deployed to datahub-wheels (Preview) April 27, 2026 02:46 View deployment

vercel Bot deployed to Preview April 27, 2026 02:58 View deployment

fix(test): fix mypy union-attr errors in test_chunking_source.py

9f67dae

mypy 1.17.1 (CI version) doesn't narrow the aspect union through hasattr checks. Replace with getattr chains which are unconditionally safe. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot deployed to datahub-wheels (Preview) April 27, 2026 03:30 View deployment

vercel Bot deployed to Preview April 27, 2026 03:42 View deployment

alexsku reviewed Apr 27, 2026

View reviewed changes

alexsku approved these changes Apr 27, 2026

View reviewed changes

maggiehays added the pending-submitter-merge label Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(search): add local embedding provider for on-premise semantic search (Ollama)#17201

feat(search): add local embedding provider for on-premise semantic search (Ollama)#17201
shirshanka wants to merge 15 commits intomasterfrom
worktree-feat+local-embedding-provider

shirshanka commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

codecov Bot commented Apr 26, 2026 •

edited

Loading

Uh oh!

alexsku Apr 27, 2026

Uh oh!

alexsku Apr 27, 2026

Uh oh!

alexsku left a comment •

edited

Loading

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shirshanka commented Apr 26, 2026

Summary

How it works

Quick start (managed)

BYO existing server

Configuration (env vars)

Actionable error messages

Files changed

Test plan

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

codecov Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alexsku Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

alexsku Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

alexsku left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

High-priority issues

1. test_embedding_capability doesn't pass api_base for the local provider — connection test routes to api.openai.com

2. Smoke test queries a non-existent GraphQL field and silently skips

Other notes

Questions

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Apr 26, 2026 •

edited

Loading

alexsku left a comment •

edited

Loading

1. `test_embedding_capability` doesn't pass `api_base` for the `local` provider — connection test routes to `api.openai.com`