Repository Guidelines

Project Layout

src/openaivec/: core batched wrappers (_responses.py, _embeddings.py), batching/caching internals (_cache/proxy.py, _cache/optimize.py, _cache/_backend.py), provider/DI setup (_provider.py, _di.py), schema inference (_schema/), and integrations (pandas_ext/, spark_ext.py, duckdb_ext.py).
src/openaivec/task/: function-style task factories by domain (nlp/, customer_support/, table/) plus registry plumbing in _registry.py.
tests/: mirrors the source layout, including focused suites in tests/_cache/ and tests/_schema/.
docs/ holds MkDocs sources, site/ generated pages, and artifacts/ scratch assets kept out of releases.

Remote batched execution goes through BatchCache / AsyncBatchCache in openaivec._cache; proxies dedupe inputs in-order, require same-length outputs, and release in-flight waiters on failure. The cache layer is pluggable via CacheBackend (default: in-memory OrderedDict); DuckDBCacheBackend provides persistent cross-session caching.
batch_size behavior is shared across sync/async proxies: None enables BatchSizeSuggester auto-tuning (target ~30-60s per batch), positive values force fixed chunks, and <= 0 processes all items in one call.
Progress bars appear only when show_progress=True and the runtime is notebook-like.
_responses.py / _embeddings.py are batched OpenAI wrappers with retry/backoff; structured outputs use Pydantic response_format, and _responses.py retries schema failures with validation feedback (max_validation_retries).
parse helpers infer schema when response_format=None; pass explicit models when deterministic output shape is required.
Reuse caches from *_with_cache helpers (or Spark UDF-local caches) per operation and clear them (clear/aclose) when finished to avoid unbounded cache growth.
duckdb_ext.py provides DuckDB UDF registration (responses_udf, embeddings_udf, task_udf), similarity_search for top-k cosine queries, and pydantic_to_duckdb_ddl for schema-to-DDL conversion. Use DuckDBCacheBackend as the cache field of BatchCache for persistent cross-session caching. DuckDB is a core dependency.

Export tasks as factory functions (for example nlp.sentiment_analysis()), not constant task instances.
Each task module should define a TASK_SPEC entry for task._registry, and task response models should reject unknown fields (ConfigDict(extra="forbid")).
Use PreparedTask for reusable instruction/schema pairs; it is immutable and intentionally does not store default API kwargs.

uv sync --all-extras --dev prepares extras and tooling; iterate with uv run pytest -m "not slow and not requires_api" before a full uv run pytest.
Run focused suites for touched subsystems when possible (for example uv run pytest tests/_cache tests/_schema).
uv run ruff check . --fix enforces style, uv run pyright guards API changes, and uv build validates the distribution.
Use uv pip install -e . only when external tooling requires an editable install.

Target Python 3.10+, rely on absolute imports, and keep helpers private with leading underscores; expose symbols via explicit __all__ (internal modules can keep __all__ = [] unless specific exports are required).
Apply Google-style docstrings with (type) Args, Returns/Raises sections, double-backtick literals, and doctest-style Example: blocks (>>>) when useful.
Keep sync/async APIs behaviorally aligned (.ai.* vs .aio.*, Batch* vs AsyncBatch*), dataframe accessors descriptive (responses, extract, fillna), and raise narrow exceptions (ValueError, TypeError).

Pytest discovers tests/test_*.py; parametrize to cover pandas vectorization, Spark UDFs, and async pathways.
Use markers consistently: @pytest.mark.requires_api, @pytest.mark.slow, @pytest.mark.spark, @pytest.mark.integration, @pytest.mark.asyncio; skip gracefully when credentials or optional deps are missing.
Add regression tests before fixes, assert on structure/length/order rather than verbatim text, and prefer shared fixtures over heavy mocking.

Commits follow type(scope): summary (e.g., fix(pandas): guard empty batch) and avoid merge commits within feature branches.
Pull requests explain motivation, outline the solution, link issues, list doc updates, and include the latest uv run pytest and uv run ruff check . --fix output; attach screenshots for doc or tutorial changes.

Auth precedence is OPENAI_API_KEY first, then Azure (AZURE_OPENAI_BASE_URL + AZURE_OPENAI_API_VERSION, with optional AZURE_OPENAI_API_KEY for API-key auth).
Azure endpoints should end with /openai/v1/ (legacy paths work but emit warnings).
For Spark, call setup / setup_azure before registering UDFs so local and executor environments stay in sync.
Keep local secrets under artifacts/, never commit credentials, and rely on CI-managed secrets when extending automation.