Skip to content

feat(search): add local embedding provider for on-premise semantic search (Ollama)#17201

Open
shirshanka wants to merge 15 commits intomasterfrom
worktree-feat+local-embedding-provider
Open

feat(search): add local embedding provider for on-premise semantic search (Ollama)#17201
shirshanka wants to merge 15 commits intomasterfrom
worktree-feat+local-embedding-provider

Conversation

@shirshanka
Copy link
Copy Markdown
Contributor

Summary

Adds a local embedding provider so DataHub can run fully on-premise semantic search without cloud API keys. The provider calls any locally-running OpenAI-compatible embeddings server — primarily Ollama — using java.net.http.HttpClient and Jackson (zero new dependencies).

  • New Java provider: LocalEmbeddingProvider implements EmbeddingProvider via HTTP to {endpoint}/v1/embeddings
  • Docker Compose: new docker-compose.ollama.yml with ollama service + ollama-model-init one-shot container that pulls and warms up the model on first start
  • Gradle task: quickstartDebugAi activates both debug and debug-ai profiles so Ollama starts alongside GMS
  • Python ingestion: chunking_source.py / chunking_config.py support provider=local via litellm's OpenAI-compatible routing
  • Dev tooling: datahub-dev.sh start --ai spins up the full managed stack and blocks until Ollama's model is warm; --embeddings-endpoint allows BYO server; --no-ai clears the env vars

Default model: nomic-embed-text (768 dimensions) — best OSS quality/speed balance available on Ollama.

How it works

GMS → LocalEmbeddingProvider → HTTP POST /v1/embeddings → Ollama (nomic-embed-text)
                                                         → LM Studio / llama.cpp / any compatible server

Quick start (managed)

scripts/dev/datahub-dev.sh start --ai
# Ollama starts, model is pulled and warmed up, GMS starts with semantic search enabled.
# First search query has no cold-start delay.

BYO existing server

scripts/dev/datahub-dev.sh start --ai \
  --embeddings-endpoint http://localhost:11434/v1/embeddings \
  --embeddings-model mxbai-embed-large

Configuration (env vars)

Env var Default Description
EMBEDDING_PROVIDER_TYPE openai Set to local
LOCAL_EMBEDDING_ENDPOINT http://localhost:11434/v1/embeddings Embeddings server URL
LOCAL_EMBEDDING_MODEL nomic-embed-text Model name (must be pulled on server)
LOCAL_EMBEDDING_VECTOR_DIMENSION 768 Override when switching to a different-dimension model

Actionable error messages

  • Server not running → "Cannot connect to … Is it running? Start Ollama with: ollama serve"
  • Model not pulled → "Model not pulled? Try: ollama pull nomic-embed-text"

Files changed

File Change
metadata-io/.../LocalEmbeddingProvider.java New provider (zero new deps)
metadata-io/.../LocalEmbeddingProviderTest.java 15 unit tests
metadata-service/.../EmbeddingProviderConfiguration.java Added LocalConfig nested class
metadata-service/.../EmbeddingProviderFactory.java Added case "local"
metadata-service/.../application.yaml Added local: block + nomic_embed_text model entry
docker/profiles/docker-compose.ollama.yml New Ollama service + warmup init container
docker/profiles/docker-compose.yml Include ollama compose file
docker/build.gradle New quickstartDebugAi task (debug + debug-ai profiles)
metadata-ingestion/.../chunking_config.py local provider support, endpoint field, constant
metadata-ingestion/.../chunking_source.py Local provider dispatch, model_key fix, validate_provider fix
scripts/dev/datahub_dev.py --ai, --no-ai, --embeddings-endpoint, --embeddings-model, model-ready wait
smoke-test/tests/semantic/test_local_embedding_provider.py 4 smoke tests (gate: LOCAL_EMBEDDING_PROVIDER_TESTS=true)

Test plan

  • ./gradlew :metadata-io:test --tests "*LocalEmbeddingProvider*" — 15 unit tests pass
  • scripts/dev/datahub-dev.sh start --ai — Ollama starts, model pulled and warmed, GMS healthy
  • Smoke tests (LOCAL_EMBEDDING_PROVIDER_TESTS=true):
    • Connectivity probe ✓
    • nomic_embed_text key with 768-dim vectors confirmed ✓
    • Semantic ranking correct: "Data Access Request Process" ranks above unrelated docs for "how to request data access permissions" ✓
  • ./gradlew :metadata-ingestion:lintFix — clean

🤖 Generated with Claude Code

shirshanka and others added 2 commits April 26, 2026 13:10
…arch

Adds a `local` embedding provider that calls any OpenAI-compatible embedding
server (primarily Ollama) without cloud API keys, enabling fully on-premise
semantic search deployments.

## Changes

### Java backend
- `LocalEmbeddingProvider`: HTTP client using `java.net.http.HttpClient` to
  call Ollama's `/v1/embeddings` endpoint; ConnectException gives actionable
  "ollama serve" hint; 404 gives "ollama pull <model>" hint
- `EmbeddingProviderConfiguration`: added `LocalConfig` with `endpoint` and
  `model` fields defaulting to Ollama + nomic-embed-text
- `EmbeddingProviderFactory`: added `case "local"` dispatch
- `application.yaml`: added `local:` config block and `nomic_embed_text` (768d)
  model entry under `embeddingProvider.models`

### Docker
- `docker-compose.ollama.yml`: new `ollama` service (port 11434, profile
  `debug-ai`/`quickstart-ai`) + `ollama-model-init` one-shot model puller
- `docker-compose.yml`: includes the new ollama compose file
- `build.gradle`: new `quickstartDebugAi` Gradle task activating both `debug`
  and `debug-ai` profiles simultaneously

### Python ingestion
- `chunking_config.py`: added `"local"` to provider enum, `endpoint` field,
  `from_server()` support reading `LOCAL_EMBEDDING_ENDPOINT` env var
- `chunking_source.py`: local provider branch using litellm `openai/<model>`
  routing with `api_base` derived from endpoint and `api_key="local"`

### Dev tooling
- `datahub_dev.sh` / `datahub_dev.py`: `--ai` flag sets 5 required env vars
  (provider type, semantic search flags, Ollama endpoint/model) and activates
  the `quickstartDebugAi` Gradle task

### Tests
- `LocalEmbeddingProviderTest.java`: 12 unit tests covering success, retries,
  ConnectException hint, 404 hint, invalid JSON
- `test_local_embedding_provider.py`: 4 smoke tests verifying Ollama
  reachability, embedding key/dimension correctness, and semantic ranking

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Split LocalEmbeddingProvider timeout into CONNECT_TIMEOUT=10s and
  REQUEST_TIMEOUT=120s; cold GGUF model loading can take 60s+ so a longer
  read timeout is needed while the connect timeout stays short
- Fix ConnectException detection: HttpClient sometimes wraps it inside
  IOException; add getCause() check so the helpful "ollama serve" hint
  fires in both cases; extract to newConnectError() helper
- Add testWrappedConnectException and testIoExceptionRetryExhausted tests
  (15 total, was 12)
- Fix _validate_provider_config: add 'local' branch so test_connection
  correctly reports capability instead of always returning False
- Fix SemanticContent model_key: use model_embedding_key from server config
  when available (authoritative), fall back to derivation only when not set
- Extract _LOCAL_EMBEDDING_DEFAULT_ENDPOINT constant in chunking_config.py
  to keep Java and Python defaults in sync
- Ollama-model-init: add warmup embedding request after model pull so GGUF
  is loaded into memory before the container exits; add restart: "no"
- application.yaml: make nomic_embed_text vectorDimension configurable via
  LOCAL_EMBEDDING_VECTOR_DIMENSION env var
- datahub_dev.py: add --no-ai flag to clear AI env vars; add
  --embeddings-endpoint (BYO server, skips Ollama container) and
  --embeddings-model flags; add _wait_for_ollama_model_ready() probe so
  'start --ai' blocks until model is loaded and the first search query
  is warm; future AI capabilities (chat etc.) can add --chat-endpoint

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Linear: PFP-3533

@github-actions github-actions Bot added ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment smoke_test Contains changes related to smoke tests labels Apr 26, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 26, 2026

Codecov Report

❌ Patch coverage is 88.09524% with 15 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...tory/search/semantic/EmbeddingProviderFactory.java 0.00% 6 Missing ⚠️
...adata/search/embedding/LocalEmbeddingProvider.java 94.25% 2 Missing and 3 partials ⚠️
...b/ingestion/source/unstructured/chunking_config.py 72.72% 3 Missing ⚠️
.../config/search/EmbeddingProviderConfiguration.java 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

shirshanka and others added 2 commits April 26, 2026 14:02
New workflow docker-quickstart-ai.yml:
- Triggers on changes to LocalEmbeddingProvider, EmbeddingProviderFactory,
  application.yaml, docker-compose.ollama.yml, chunking_*.py, and the
  semantic smoke tests — plus nightly schedule and workflow_dispatch
- Builds DataHub quickstart images from source (:docker:buildImagesQuickstart)
- Starts docker compose with both quickstart-consumers and quickstart-ai
  profiles simultaneously (GMS + core services + Ollama)
- Injects AI env vars into GMS via DATAHUB_LOCAL_COMMON_ENV temp file
  (EMBEDDING_PROVIDER_TYPE=local, semantic search flags, Ollama endpoint)
- Waits for Ollama model to be fully loaded via host-side probe loop
  (40 retries × 10s = up to 400s) before running tests
- Runs tests/semantic/test_local_embedding_provider.py with
  LOCAL_EMBEDDING_PROVIDER_TESTS=true
- Supports --embedding-model dispatch input to test non-default models
- Uploads JUnit XML results (30d retention) and Docker logs on failure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ollama-model-init exits with code 0 after pulling and warming the model.
docker compose --wait treats any container exit as failure, which breaks
the startup step even on success. Replace with explicit health polling:
- GMS readiness: curl localhost:8080/health (90x10s = 900s max)
- Ollama model readiness: embed probe (60x10s = 600s max)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Pin pytest-timeout to exact version to satisfy dep-pinning CI check
- Fix mypy error: raw_endpoint or-chain now ends on a str literal so
  mypy correctly infers str (not str | None) for the slice/endswith ops

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cover the local provider branch in chunking_source.py: model prefixing,
api_base derivation from config/env/fallback, model_embedding_key
normalisation, and provider-config validation. Brings patch coverage above
the 75% Codecov threshold.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
shirshanka and others added 2 commits April 26, 2026 19:42
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sitive

PropertiesCollectorConfigurationTest requires all new config properties to
be explicitly listed as sensitive or non-sensitive. The LocalConfig endpoint
and model fields, plus the nomic_embed_text model index settings, are
non-sensitive operational parameters (no secrets).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mypy 1.17.1 (CI version) doesn't narrow the aspect union through hasattr
checks. Replace with getattr chains which are unconditionally safe.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

# Use litellm.embedding() which works with Bedrock, Cohere, OpenAI, and
# local OpenAI-compatible servers.
response = litellm.embedding(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to this pr, lately we've seen so many security issues related to litellm, perhaps we should consider alternatives

HttpRequest request =
HttpRequest.newBuilder()
.uri(URI.create(endpoint))
.timeout(REQUEST_TIMEOUT)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is request timeout here supposed to save us from the case when the server malfunctioning? should be make it larger? 120 seconds seems too aggressive

Copy link
Copy Markdown
Collaborator

@alexsku alexsku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work — the feature is well-scoped, defaults are unchanged for existing deployments, and CI (including the new dedicated AI smoke workflow) is green. Approving with two real bugs and a few smaller notes the author should look at before merge.

High-priority issues

1. test_embedding_capability doesn't pass api_base for the local provider — connection test routes to api.openai.com

File: metadata-ingestion/src/datahub/ingestion/source/unstructured/chunking_source.py:1212

_validate_provider_config returns embedding_model = "openai/<model>" for provider=\"local\" (line 1119-1121). But test_embedding_capability then calls litellm.embedding(model=embedding_model, …) without an api_base:

response = litellm.embedding(
    model=embedding_model,
    input=[test_text],
    api_key=embedding_config.api_key.get_secret_value() if embedding_config.api_key else None,
    aws_region_name=embedding_config.aws_region,
)

litellm interprets openai/<model> as a real OpenAI call and routes to https://api.openai.com/v1/embeddings. With no real key, the test reports the local provider as not capable even when the local server is healthy.

This is reachable in production: notion_source.py:1048 and confluence_source.py:1121 both call DocumentChunkingSource.test_embedding_capability(resolved_config) during test_connection, and resolved_config can resolve to provider=\"local\" via EmbeddingConfig.from_server when the server is configured for local embeddings.

Fix: Mirror _generate_embeddings — when embedding_config.provider == \"local\", derive api_base from embedding_config.endpoint (or LOCAL_EMBEDDING_ENDPOINT), strip /embeddings, and pass api_key=\"local\". A small _resolve_local_litellm_kwargs(embedding_config) helper would let both call sites share the logic. Add a unit test mirroring the existing test_local_provider_api_base_* tests.

2. Smoke test queries a non-existent GraphQL field and silently skips

File: smoke-test/tests/semantic/test_local_embedding_provider.py:179-222

The test queries appConfig.semanticSearchConfig.embeddingProviderType, but embeddingProviderType does not exist in app.semantic.graphql. SemanticSearchConfig exposes embeddingConfig.provider, not embeddingProviderType. Grep confirms this symbol is referenced only in this test file.

GraphQL validation rejects the query → result[\"errors\"] is populated → the if \"errors\" in result: pytest.skip(...) branch fires → the assert provider_type == \"local\" line is never reached. So the test silently always skips, regardless of how the server is actually configured.

Fix: Either add the field to the GraphQL schema (and AppConfigResolver), or change the query to use the existing field:

semanticSearchConfig { embeddingConfig { provider } }

…and assert on provider == \"local\".

Other notes

  • [medium / correctness] chunking_source.py:200 — At runtime, the local branch unconditionally prepends openai/ to the model name, but _validate_provider_config (line 1120) skips the prefix if it's already present. The OpenAI branch (line 185) handles this consistently. Result: a user who sets model: \"openai/nomic-embed-text\" gets a different model string between validation (openai/nomic-embed-text) and runtime (openai/openai/nomic-embed-text). The test test_local_provider_already_prefixed_model locks in the buggy behavior, but its docstring says the opposite. Apply the same if not model_name.startswith(\"openai/\") guard at runtime.

  • [medium / operability] docker/profiles/docker-compose.ollama.yml:53ollama-model-init shells out to curl for warmup, but the ollama/ollama base image isn't documented to ship curl. CI works around this by polling from the runner, but local quickstartDebugAi users may see the init container marked as failed even when everything works. Consider using ollama run for warmup, or installing curl explicitly.

  • [low / dx] LocalEmbeddingProvider.java:157Authorization: Bearer local is hardcoded. Fine for Ollama, but a user putting an authenticated proxy in front has no way to override it. Consider exposing an optional token field on LocalConfig (default empty → current behavior).

  • [low / tests] LocalEmbeddingProviderTest.javatestNullTextThrows, testDefaultConstructor, and testTwoArgConstructor test the @Nonnull framework contract / Lombok-style construction. Per AGENTS.md testing conventions these are anti-pattern tests and can be removed.

  • What's missing: a docs/how/ page for the on-prem semantic search flow with Ollama — the PR description content is great and belongs in the docs site.

Questions

  1. Was embeddingProviderType intended as a GraphQL schema addition? If so, it's missing; if not, the smoke test should query embeddingConfig.provider.
  2. Why does runtime __init__ not strip an existing openai/ prefix for local, while _validate_provider_config does? Intentional or oversight?
  3. Did you verify curl is present in ollama/ollama:latest? CI doesn't depend on it.

Approving — runtime path is solid, default config is unchanged, and the broken test_connection path / silently-skipping smoke test should be quick fixes after merge.

@github-actions
Copy link
Copy Markdown
Contributor

Your PR has been assigned to @shirshanka (shirshanka) for review (PFP-3533).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops PR or Issue related to DataHub backend & deployment ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge product PR or Issue related to the DataHub UI/UX smoke_test Contains changes related to smoke tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants