feat(search): add local embedding provider for on-premise semantic search (Ollama)#17201
feat(search): add local embedding provider for on-premise semantic search (Ollama)#17201shirshanka wants to merge 15 commits intomasterfrom
Conversation
…arch Adds a `local` embedding provider that calls any OpenAI-compatible embedding server (primarily Ollama) without cloud API keys, enabling fully on-premise semantic search deployments. ## Changes ### Java backend - `LocalEmbeddingProvider`: HTTP client using `java.net.http.HttpClient` to call Ollama's `/v1/embeddings` endpoint; ConnectException gives actionable "ollama serve" hint; 404 gives "ollama pull <model>" hint - `EmbeddingProviderConfiguration`: added `LocalConfig` with `endpoint` and `model` fields defaulting to Ollama + nomic-embed-text - `EmbeddingProviderFactory`: added `case "local"` dispatch - `application.yaml`: added `local:` config block and `nomic_embed_text` (768d) model entry under `embeddingProvider.models` ### Docker - `docker-compose.ollama.yml`: new `ollama` service (port 11434, profile `debug-ai`/`quickstart-ai`) + `ollama-model-init` one-shot model puller - `docker-compose.yml`: includes the new ollama compose file - `build.gradle`: new `quickstartDebugAi` Gradle task activating both `debug` and `debug-ai` profiles simultaneously ### Python ingestion - `chunking_config.py`: added `"local"` to provider enum, `endpoint` field, `from_server()` support reading `LOCAL_EMBEDDING_ENDPOINT` env var - `chunking_source.py`: local provider branch using litellm `openai/<model>` routing with `api_base` derived from endpoint and `api_key="local"` ### Dev tooling - `datahub_dev.sh` / `datahub_dev.py`: `--ai` flag sets 5 required env vars (provider type, semantic search flags, Ollama endpoint/model) and activates the `quickstartDebugAi` Gradle task ### Tests - `LocalEmbeddingProviderTest.java`: 12 unit tests covering success, retries, ConnectException hint, 404 hint, invalid JSON - `test_local_embedding_provider.py`: 4 smoke tests verifying Ollama reachability, embedding key/dimension correctness, and semantic ranking Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Split LocalEmbeddingProvider timeout into CONNECT_TIMEOUT=10s and REQUEST_TIMEOUT=120s; cold GGUF model loading can take 60s+ so a longer read timeout is needed while the connect timeout stays short - Fix ConnectException detection: HttpClient sometimes wraps it inside IOException; add getCause() check so the helpful "ollama serve" hint fires in both cases; extract to newConnectError() helper - Add testWrappedConnectException and testIoExceptionRetryExhausted tests (15 total, was 12) - Fix _validate_provider_config: add 'local' branch so test_connection correctly reports capability instead of always returning False - Fix SemanticContent model_key: use model_embedding_key from server config when available (authoritative), fall back to derivation only when not set - Extract _LOCAL_EMBEDDING_DEFAULT_ENDPOINT constant in chunking_config.py to keep Java and Python defaults in sync - Ollama-model-init: add warmup embedding request after model pull so GGUF is loaded into memory before the container exits; add restart: "no" - application.yaml: make nomic_embed_text vectorDimension configurable via LOCAL_EMBEDDING_VECTOR_DIMENSION env var - datahub_dev.py: add --no-ai flag to clear AI env vars; add --embeddings-endpoint (BYO server, skips Ollama container) and --embeddings-model flags; add _wait_for_ollama_model_ready() probe so 'start --ai' blocks until model is loaded and the first search query is warm; future AI capabilities (chat etc.) can add --chat-endpoint Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Linear: PFP-3533 |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
New workflow docker-quickstart-ai.yml: - Triggers on changes to LocalEmbeddingProvider, EmbeddingProviderFactory, application.yaml, docker-compose.ollama.yml, chunking_*.py, and the semantic smoke tests — plus nightly schedule and workflow_dispatch - Builds DataHub quickstart images from source (:docker:buildImagesQuickstart) - Starts docker compose with both quickstart-consumers and quickstart-ai profiles simultaneously (GMS + core services + Ollama) - Injects AI env vars into GMS via DATAHUB_LOCAL_COMMON_ENV temp file (EMBEDDING_PROVIDER_TYPE=local, semantic search flags, Ollama endpoint) - Waits for Ollama model to be fully loaded via host-side probe loop (40 retries × 10s = up to 400s) before running tests - Runs tests/semantic/test_local_embedding_provider.py with LOCAL_EMBEDDING_PROVIDER_TESTS=true - Supports --embedding-model dispatch input to test non-default models - Uploads JUnit XML results (30d retention) and Docker logs on failure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ollama-model-init exits with code 0 after pulling and warming the model. docker compose --wait treats any container exit as failure, which breaks the startup step even on success. Replace with explicit health polling: - GMS readiness: curl localhost:8080/health (90x10s = 900s max) - Ollama model readiness: embed probe (60x10s = 600s max) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Pin pytest-timeout to exact version to satisfy dep-pinning CI check - Fix mypy error: raw_endpoint or-chain now ends on a str literal so mypy correctly infers str (not str | None) for the slice/endswith ops Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cover the local provider branch in chunking_source.py: model prefixing, api_base derivation from config/env/fallback, model_embedding_key normalisation, and provider-config validation. Brings patch coverage above the 75% Codecov threshold. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sitive PropertiesCollectorConfigurationTest requires all new config properties to be explicitly listed as sensitive or non-sensitive. The LocalConfig endpoint and model fields, plus the nomic_embed_text model index settings, are non-sensitive operational parameters (no secrets). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mypy 1.17.1 (CI version) doesn't narrow the aspect union through hasattr checks. Replace with getattr chains which are unconditionally safe. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
||
| # Use litellm.embedding() which works with Bedrock, Cohere, OpenAI, and | ||
| # local OpenAI-compatible servers. | ||
| response = litellm.embedding( |
There was a problem hiding this comment.
unrelated to this pr, lately we've seen so many security issues related to litellm, perhaps we should consider alternatives
| HttpRequest request = | ||
| HttpRequest.newBuilder() | ||
| .uri(URI.create(endpoint)) | ||
| .timeout(REQUEST_TIMEOUT) |
There was a problem hiding this comment.
is request timeout here supposed to save us from the case when the server malfunctioning? should be make it larger? 120 seconds seems too aggressive
There was a problem hiding this comment.
Nice work — the feature is well-scoped, defaults are unchanged for existing deployments, and CI (including the new dedicated AI smoke workflow) is green. Approving with two real bugs and a few smaller notes the author should look at before merge.
High-priority issues
1. test_embedding_capability doesn't pass api_base for the local provider — connection test routes to api.openai.com
File: metadata-ingestion/src/datahub/ingestion/source/unstructured/chunking_source.py:1212
_validate_provider_config returns embedding_model = "openai/<model>" for provider=\"local\" (line 1119-1121). But test_embedding_capability then calls litellm.embedding(model=embedding_model, …) without an api_base:
response = litellm.embedding(
model=embedding_model,
input=[test_text],
api_key=embedding_config.api_key.get_secret_value() if embedding_config.api_key else None,
aws_region_name=embedding_config.aws_region,
)litellm interprets openai/<model> as a real OpenAI call and routes to https://api.openai.com/v1/embeddings. With no real key, the test reports the local provider as not capable even when the local server is healthy.
This is reachable in production: notion_source.py:1048 and confluence_source.py:1121 both call DocumentChunkingSource.test_embedding_capability(resolved_config) during test_connection, and resolved_config can resolve to provider=\"local\" via EmbeddingConfig.from_server when the server is configured for local embeddings.
Fix: Mirror _generate_embeddings — when embedding_config.provider == \"local\", derive api_base from embedding_config.endpoint (or LOCAL_EMBEDDING_ENDPOINT), strip /embeddings, and pass api_key=\"local\". A small _resolve_local_litellm_kwargs(embedding_config) helper would let both call sites share the logic. Add a unit test mirroring the existing test_local_provider_api_base_* tests.
2. Smoke test queries a non-existent GraphQL field and silently skips
File: smoke-test/tests/semantic/test_local_embedding_provider.py:179-222
The test queries appConfig.semanticSearchConfig.embeddingProviderType, but embeddingProviderType does not exist in app.semantic.graphql. SemanticSearchConfig exposes embeddingConfig.provider, not embeddingProviderType. Grep confirms this symbol is referenced only in this test file.
GraphQL validation rejects the query → result[\"errors\"] is populated → the if \"errors\" in result: pytest.skip(...) branch fires → the assert provider_type == \"local\" line is never reached. So the test silently always skips, regardless of how the server is actually configured.
Fix: Either add the field to the GraphQL schema (and AppConfigResolver), or change the query to use the existing field:
semanticSearchConfig { embeddingConfig { provider } }…and assert on provider == \"local\".
Other notes
-
[medium / correctness]
chunking_source.py:200— At runtime, the local branch unconditionally prependsopenai/to the model name, but_validate_provider_config(line 1120) skips the prefix if it's already present. The OpenAI branch (line 185) handles this consistently. Result: a user who setsmodel: \"openai/nomic-embed-text\"gets a different model string between validation (openai/nomic-embed-text) and runtime (openai/openai/nomic-embed-text). The testtest_local_provider_already_prefixed_modellocks in the buggy behavior, but its docstring says the opposite. Apply the sameif not model_name.startswith(\"openai/\")guard at runtime. -
[medium / operability]
docker/profiles/docker-compose.ollama.yml:53—ollama-model-initshells out tocurlfor warmup, but theollama/ollamabase image isn't documented to shipcurl. CI works around this by polling from the runner, but localquickstartDebugAiusers may see the init container marked as failed even when everything works. Consider usingollama runfor warmup, or installing curl explicitly. -
[low / dx]
LocalEmbeddingProvider.java:157—Authorization: Bearer localis hardcoded. Fine for Ollama, but a user putting an authenticated proxy in front has no way to override it. Consider exposing an optionaltokenfield onLocalConfig(default empty → current behavior). -
[low / tests]
LocalEmbeddingProviderTest.java—testNullTextThrows,testDefaultConstructor, andtestTwoArgConstructortest the@Nonnullframework contract / Lombok-style construction. Per AGENTS.md testing conventions these are anti-pattern tests and can be removed. -
What's missing: a
docs/how/page for the on-prem semantic search flow with Ollama — the PR description content is great and belongs in the docs site.
Questions
- Was
embeddingProviderTypeintended as a GraphQL schema addition? If so, it's missing; if not, the smoke test should queryembeddingConfig.provider. - Why does runtime
__init__not strip an existingopenai/prefix forlocal, while_validate_provider_configdoes? Intentional or oversight? - Did you verify
curlis present inollama/ollama:latest? CI doesn't depend on it.
Approving — runtime path is solid, default config is unchanged, and the broken test_connection path / silently-skipping smoke test should be quick fixes after merge.
|
Your PR has been assigned to @shirshanka (shirshanka) for review (PFP-3533). |
Summary
Adds a
localembedding provider so DataHub can run fully on-premise semantic search without cloud API keys. The provider calls any locally-running OpenAI-compatible embeddings server — primarily Ollama — usingjava.net.http.HttpClientand Jackson (zero new dependencies).LocalEmbeddingProviderimplementsEmbeddingProvidervia HTTP to{endpoint}/v1/embeddingsdocker-compose.ollama.ymlwithollamaservice +ollama-model-initone-shot container that pulls and warms up the model on first startquickstartDebugAiactivates bothdebuganddebug-aiprofiles so Ollama starts alongside GMSchunking_source.py/chunking_config.pysupportprovider=localvia litellm's OpenAI-compatible routingdatahub-dev.sh start --aispins up the full managed stack and blocks until Ollama's model is warm;--embeddings-endpointallows BYO server;--no-aiclears the env varsDefault model:
nomic-embed-text(768 dimensions) — best OSS quality/speed balance available on Ollama.How it works
Quick start (managed)
BYO existing server
Configuration (env vars)
EMBEDDING_PROVIDER_TYPEopenailocalLOCAL_EMBEDDING_ENDPOINThttp://localhost:11434/v1/embeddingsLOCAL_EMBEDDING_MODELnomic-embed-textLOCAL_EMBEDDING_VECTOR_DIMENSION768Actionable error messages
"Cannot connect to … Is it running? Start Ollama with: ollama serve""Model not pulled? Try: ollama pull nomic-embed-text"Files changed
metadata-io/.../LocalEmbeddingProvider.javametadata-io/.../LocalEmbeddingProviderTest.javametadata-service/.../EmbeddingProviderConfiguration.javaLocalConfignested classmetadata-service/.../EmbeddingProviderFactory.javacase "local"metadata-service/.../application.yamllocal:block +nomic_embed_textmodel entrydocker/profiles/docker-compose.ollama.ymldocker/profiles/docker-compose.ymldocker/build.gradlequickstartDebugAitask (debug + debug-ai profiles)metadata-ingestion/.../chunking_config.pylocalprovider support, endpoint field, constantmetadata-ingestion/.../chunking_source.pyscripts/dev/datahub_dev.py--ai,--no-ai,--embeddings-endpoint,--embeddings-model, model-ready waitsmoke-test/tests/semantic/test_local_embedding_provider.pyLOCAL_EMBEDDING_PROVIDER_TESTS=true)Test plan
./gradlew :metadata-io:test --tests "*LocalEmbeddingProvider*"— 15 unit tests passscripts/dev/datahub-dev.sh start --ai— Ollama starts, model pulled and warmed, GMS healthyLOCAL_EMBEDDING_PROVIDER_TESTS=true):nomic_embed_textkey with 768-dim vectors confirmed ✓./gradlew :metadata-ingestion:lintFix— clean🤖 Generated with Claude Code