notebooklm-graph-pipe extends upstream llm-graph-builder with agent skills and workflows that combine NotebookLM and Neo4j across Codex, Claude, and OpenCode. It provides an end-to-end pipeline for turning a NotebookLM-backed corpus into a Neo4j graph, a self-improving graph consolidation workflow, and A/B evaluation of notebook-only retrieval vs hybrid vector RAG + GraphRAG.
flowchart LR
A["Local Corpus"] -- "sync_notebook_graph.py" --> B["NotebookLM"]
B -- "export" --> C["Staged .txt Files"]
C -- "build_graph.py" --> D["Neo4j Graph"]
D -- "postprocess_graph.py" --> E["Post-processed Graph"]
E -- "run_ab_evaluation.py" --> F["A/B Evaluation Report"]
E -- "consolidate_self_improving.py" --> G["Consolidated Graph"]
- Python
3.12+ - Google account signed into NotebookLM
notebooklm-mcpandneo4jMCP servers configured for the bundled NotebookLM and graph workflows- Docker if you want
scripts/sync_notebook_graph.pyto provision or resume a managed Neo4j container automatically, or your own Neo4j instance if you want to pass explicit--neo4j-*connection details - One of the supported agents on
PATH:codex,claude, oropencode
git clone --recurse-submodules <repo-url>
cd llm-graph-builder-scripts
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt -c vendor/llm-graph-builder/backend/constraints.txtnlm loginSet provider keys only for the flows that use them:
# Optional - required for the default Google-backed postprocess/consolidation path,
# or whenever your routing config selects Google providers
export GOOGLE_API_KEY="your-key-here"
# Optional - only if your routing config selects these providers
export OPENAI_API_KEY="..."
export OPENROUTER_API_KEY="..."You do not need to provide Neo4j connection details to scripts/sync_notebook_graph.py by default. If you omit --neo4j-uri, --neo4j-user, --neo4j-password, and --neo4j-database, the sync workflow provisions or resumes a Docker-managed Neo4j runtime for you automatically.
Only pass explicit Neo4j flags when you want the scripts to target a Neo4j instance that you manage yourself:
python scripts/sync_notebook_graph.py create \
--dataset-dir path/to/corpus \
--notebook-title my-corpus \
--neo4j-uri bolt://127.0.0.1:7687 \
--neo4j-user neo4j \
--neo4j-password your-password \
--neo4j-database neo4jNote: NotebookLM standard tier allows up to 100 notebooks and 50 sources per notebook. If source-count limits become a bottleneck, upgrade through NotebookLM or the Google AI plans.
Docker notes:
scripts/sync_notebook_graph.pycan run without Docker when you provide explicit Neo4j credentials for your own instancescripts/run_ab_evaluation.py --manifest-path ...does not require Dockerscripts/run_ab_evaluation.py --datasets ...auto-manages configured containers only when Docker is available
Assuming you have completed the Setup, the shortest path is to let scripts/sync_notebook_graph.py manage the Neo4j container automatically:
python scripts/sync_notebook_graph.py create \
--dataset-dir path/to/your/corpus \
--notebook-title my-corpus
python scripts/run_ab_evaluation.py \
--manifest-path data/notebooklm_exports/my-corpus/manifest.jsonThis writes data/notebooklm_exports/my-corpus/manifest.json, stages NotebookLM exports under sources/, builds the graph, and runs the 4-factor A/B evaluation.
scripts/build_graph.py consumes staged NotebookLM-exported .txt files. When you run scripts/sync_notebook_graph.py, it executes the bridge from a local corpus to that staged format in this order:
- it walks the files under
--dataset-dir - it uploads those files into NotebookLM
- it exports NotebookLM source content into
data/notebooklm_exports/<project_slug>/sources/*.txt - it writes
manifest.jsonwith the notebook id and Neo4j runtime - if you do not pass explicit
--neo4j-*flags, it provisions or resumes a Docker-managed Neo4j runtime - unless you pass
--skip-build, it runs graph extraction from the stagedsources/directory - unless you pass
--skip-postprocess, it runs the post-processing tail after graph extraction
Use local files that NotebookLM can ingest. The graph build itself always runs from the staged .txt exports.
python scripts/sync_notebook_graph.py create \
--dataset-dir path/to/corpus \
--notebook-title my-corpusAdd --skip-build to stop after NotebookLM sync and manifest creation. Add --skip-postprocess to skip the post-processing tail after graph extraction. Add explicit --neo4j-* flags only if you want to use your own Neo4j instance instead of the managed Docker runtime.
python scripts/sync_notebook_graph.py update \
--dataset-dir path/to/corpus \
--notebook-id 12345678-1234-1234-1234-123456789abcExplicit Neo4j flags on update override any managed Neo4j runtime recorded in the manifest.
python scripts/build_graph.py \
--sources-dir ./data/notebooklm_exports/my-corpus/sources \
--neo4j-uri bolt://127.0.0.1:7687 \
--neo4j-user neo4j \
--neo4j-password your-password \
--neo4j-database neo4j
python scripts/postprocess_graph.py \
--neo4j-uri bolt://127.0.0.1:7687 \
--neo4j-user neo4j \
--neo4j-password your-password \
--neo4j-database neo4jThese direct graph and postprocess entrypoints target a Neo4j instance explicitly, so pass --neo4j-* for the server you want to use.
python scripts/run_ab_evaluation.py \
--manifest-path ./data/notebooklm_exports/my-corpus/manifest.jsonManifest-driven evaluation loads the notebook and Neo4j runtime from the manifest, generates 8 primary questions plus 2 reserves, runs notebook_only and hybrid, and scores them on correctness, completeness, evidence quality, and cross-document synthesis.
If you want to supply your own questions:
python scripts/run_ab_evaluation.py \
--manifest-path ./data/notebooklm_exports/my-corpus/manifest.json \
--questions-file path/to/questions.json \
--dataset-label my-corpuspython scripts/consolidation/consolidate_self_improving.pyTier 1 handles lexical merges first. Tier 2 and Tier 3 then run in a self-improving loop until the consolidation gate passes or the iteration budget is exhausted.
The default local graph-build embedding is sentence-transformer with all-MiniLM-L6-v2. Routing config can switch embedding, prompt, and judge roles across these providers:
| Provider or runtime | Env variables | When required | Python dependency |
|---|---|---|---|
genai / gemini |
GOOGLE_API_KEY |
Whenever scripts/postprocess_graph.py or the default consolidation flow uses Google-backed prompt / judge / embedding roles, or whenever --llm-routing-config selects Google-backed roles |
google-genai, langchain-google-vertexai |
openai |
OPENAI_API_KEY |
Whenever the routing config selects OpenAI for embeddings or single-prompt roles | openai, langchain-openai |
openrouter |
OPENROUTER_API_KEY |
Whenever the routing config selects OpenRouter for embeddings or single-prompt roles | openai, langchain-openai |
sentence-transformer |
None | Default local graph-build embeddings, or whenever local embeddings are selected explicitly | sentence-transformers, langchain-huggingface |
Without --llm-routing-config, scripts/postprocess_graph.py and the default consolidation flow use Google-backed prompt, judge, and embedding roles, so those paths require GOOGLE_API_KEY. The main sync, graph-build, and A/B evaluation flow does not require it by default.
Supported agent runtimes for review or taxonomy-tail steps are codex, claude, and opencode. Without a routing config, consolidation defaults to codex.
The bundled notebooklm-neo4j-deep-research workflow is packaged for .claude, .opencode, and .codex. It alternates between NotebookLM answers and Neo4j neighborhood expansion, keeps only the strongest branches, and stops when additional loops stop adding signal.
notebooklm-mcp: notebook querying and NotebookLM source accessneo4j: schema reads and Cypher exploration
The bundled deep-research agent packages depend on both MCP servers:
.codex/skills/notebooklm-neo4j-deep-research/.claude/agents/notebooklm-neo4j-deep-research.md.opencode/agents/notebooklm-neo4j-deep-research.md
What the provided skill does:
- treats NotebookLM as the high-context reader and Neo4j as the topology explorer
- starts from a notebook answer, extracts concrete entities, concepts, aliases, and open questions
- expands the strongest seeds through graph neighborhoods, then turns the best graph findings into tighter NotebookLM follow-ups
- scores candidate branches for relevance, novelty, graph support, and explainability, and stops when the loop stops adding signal
Example use:
Use the bundled notebooklm-neo4j-deep-research skill against the notebook "my-corpus"
and the connected Neo4j graph. Research this question: "Which methods connect graph-based
retrieval with hallucination control in this corpus?" Use a 3-iteration loop budget and
return the full skill output.
In practice, that workflow queries NotebookLM for an initial answer, extracts high-signal seeds, probes Neo4j for neighborhoods and bridge concepts, asks targeted NotebookLM follow-ups, and returns a structured report with the final answer, iteration log, accepted/rejected branches, stop reason, and self-critique.
vendor/llm-graph-builder/: upstreamneo4j-labs/llm-graph-buildersubmodulesrc/: local backend overlay modules that override selected upstream behaviorscripts/: sync, graph build, post-processing, evaluation, and consolidation entrypointstests/: regression coverage for orchestration and overlay behavior.claude/,.opencode/,.codex/: bundled agent and skill definitions
src/ overlays vendor/llm-graph-builder/backend/src. Put local backend behavior changes in the overlay package, not in the vendored submodule.