Divergence Engines (Part 2): A Technical Framework for Useful Difference

25. October 2025
AIRAGGraphRAGContext EngineeringRetrievalMemory SystemsKnowledge GraphsComplexity

Relevance is easy to optimize. Exploration is not. If your retrieval stack only knows how to “match,” it will eventually become a gravity well — everything it returns starts to look like what it returned yesterday.

This article is a blueprint for adding the missing counter-force: a second objective (“reach”), a mode switch (converge ↔ diverge), and a set of divergence primitives (diversity, bridges, contradictions, temporal lift) that you can implement on top of standard vector + GraphRAG architectures. Treat it like a menu: pick one primitive, wire it into retrieval, measure, iterate.

Failed to load mediasrc: https://kqdcjvdzirlg4kan.public.blob.vercel-storage.com/content/articles/2025-divergence-engines/part-2/published/website/images/cover.png
Work in progress
Disclaimer

I’d like to treat this as a working notebook on implementation patterns, not a final specification. If you’re building divergence-capable systems, I’d love to hear what works (and what doesn’t) — consider this an open invitation to contribute patterns, critique approaches, and share implementations.

Part 1 made the case that similarity-first retrieval is a convergent system: it reliably collapses inquiry toward the nearest attractor basin, even when the underlying model is capable of surprise. This part is the practical follow-through — a set of implementation patterns for adding that missing counter-force at the retrieval layer. We’ll start by defining dual objectives (answering vs learning), then look at why graph-based architectures make divergence tractable, and then move through concrete divergence primitives and metrics you can actually ship and iterate on.

This article is part of ongoing research tied to my work on Recurse — a sense‑making substrate for AI work focused on context injection, relationship-based retrieval, and portability across providers. Worthwhile reading for anyone building context graphs, retrieval and agent memory systems, or trying to design “AI assistants” that help inquiry expand rather than just answer faster.

Dual Objectives

If you want retrieval to explore, you can’t only reward “closest match.” You need to explicitly reward a second outcome: results that move the inquiry forward — not just toward an answer, but toward new understanding.

Instead of asking “What best matches the question?” we should ask “What best advances the inquiry?”

Systems that discover require retrieval to optimize for more than similarity. The objective function needs a second term:

maximize: Rel(q,S)+λ×Reach(q,S)\text{Rel}(q,\mathcal{S}) + \lambda \times \text{Reach}(q,\mathcal{S})

maxSD,S=k    Rel(q,S)can we answer?  +  λ  Reach(q,S)will we learn?\max_{\mathcal{S} \subset \mathcal{D}, |\mathcal{S}|=k} \;\;\underbrace{\text{Rel}(q,\mathcal{S})}_{\text{can we answer?}} \;+\;\lambda\;\underbrace{\text{Reach}(q,\mathcal{S})}_{\text{will we learn?}}

Where Rel(q,S)\text{Rel}(q,\mathcal{S}) measures how well the retrieved set S\mathcal{S} answers query qq, and Reach(q,S)\text{Reach}(q,\mathcal{S}) measures how well S\mathcal{S} expands understanding.

Reach rewards:

Noveltydistance from prior context frames
Diversityde-correlation within the retrieved set
Bridge distancecontrolled hops across concept boundaries
Contradictionstance polarity vs current frame
Temporal liftrecent, not-yet-assimilated signals

Relevance retrieves. Reach expands. And λ\lambda is your steering knob:

λ=0\lambda=0pure precisionconvergence only
λ=1\lambda=1pure explorationdivergence only
λ=0.3\lambda=0.3balanced inquirytypical starting point

In practice, systems should expose this as a mode parameter:

ModePurposeOptimize for
convergeanswer the questionsimilarity to query
divergeexpand the questionuseful difference

The cycle: Diverge → Converge → Integrate → Repeat. This oscillation distinguishes systems that support inquiry from those that merely retrieve and paraphrase.

Without divergence capability, agent reasoning reduces to semantic autocomplete over documents.

Why Graph-Based Architectures Enable Divergence

Most modern retrieval systems are converging on graph-based architectures, and this is good news for divergence. Graphs naturally support concepts that would be awkward in pure vector search: weak ties, bridge nodes, cross-cluster traversal, even contradictory connections.

In a traditional retrieval setup, you embed documents, build a vector index, and find the nearest neighbors. Diversity means maximizing distance in a high-dimensional manifold — an expensive, imprecise operation.

But in a graph-based system, you’re working with nodes connected by edges that carry semantic relationships. This structural richness gives you handles for divergence that flat similarity search cannot offer:

  • In graph space: Diversity = traverse weak ties, hop clusters, follow typed relationships
  • In embedding space: Diversity = maximize distance in n-dimensional space

The key insight: divergence primitives work differently when you’re operating on graph structures versus flat embedding spaces. Graphs let you express diversity through relationship types: weak ties (low-weight edges), bridges (cross-cluster connections), contradictions (opposing stance edges), and temporal connections (recency-weighted paths).

Typed Edges via LLM-Assisted Construction

Graph construction during ingestion can create typed edges explicitly rather than relying solely on embedding similarity. The standard approach embeds incoming content and connects it to similar nodes based on vector distance. Typed edge construction adds a layer: retrieve relevant context via similarity, then use an LLM to analyze relationships.

The flow:

  1. Perform retrieval based on incoming chunk
  2. LLM analyzes relationship between incoming chunk and retrieved context
  3. Create typed edges (supports, contradicts, bridges-from, extends) based on semantic analysis

This was historically done via NLP-based triplet extraction, which was brittle. As LLM costs have decreased, it’s become feasible to use LLMs during ingestion to classify relationship types and identify cross-domain connections that aren’t obvious from embeddings alone.

With typed edges, divergence primitives become more deterministic: associative drift can traverse specifically weak-typed edges, cross-domain bridges can follow labeled relationships, contradictions are explicitly marked rather than inferred.

Graph-based systems enable divergence patterns that are awkward or impossible in pure vector approaches. The infrastructure is already moving in this direction; we just need to use it differently.

Divergence Primitives

Discovery isn’t magic; it has mechanics. If we want systems that explore rather than just retrieve, we need divergence primitives—first-class operations that intentionally counterbalance reduction.

Here’s what divergence primitives look like in practice:

PrimitiveDescription
Exploration modeOptimize for diversity instead of similarity
Associative DriftSample weaker semantic ties to reveal adjacent ideas
Cross-Domain BridgesForce hops across concept boundaries
Contradiction surfacingRetrieve stance-based disagreement
Negative-space mappingReturn what’s missing, not what’s present
Temporal noveltySurface recent but unassimilated signals
Temporal evolutionTrack how priors shift over time
Structural adaptation (Write-Back)Trigger update cascades when new information arrives

Let’s walk through each primitive with implementation approaches.


Associative Drift: Following Weak Ties

The Problem: Pure similarity search returns the same dense cluster every time. Weakly related but potentially valuable connections never surface.

The Solution: Sample from weaker semantic ties to reveal adjacent ideas. Instead of always taking the top-k most similar items, deliberately include items with moderate similarity that introduce novelty.

In graph terms, what you want are weak ties — connections to adjacent but not identical concepts. These paths lead to genuinely new information. The concept comes from social network analysis, specifically Granovetter’s Strength of Weak Ties. Granovetter observed that people often find jobs through acquaintances (weak ties) rather than close friends (strong ties). Close friends know mostly the same people and information you do. Acquaintances bridge to different social clusters and provide access to information you wouldn’t encounter in your immediate circle.

The same principle applies to knowledge graphs. Items most similar to your query — conceptually nearby and highly relevant — are often redundant. Items with weaker semantic similarity remain related enough to be meaningful but different enough to introduce novelty.

Associative drift operates by explicitly avoiding the most similar items and sampling from weaker connections. In graph terms, this means traversing edges with lower weights, following paths that aren’t obvious from direct similarity.

The practical implementation technique is MMR (Maximal Marginal Relevance), introduced by Carbonell and Goldstein in 1998. MMR scores items by how relevant they are to the query minus how similar they are to items already selected. This creates a greedy but effective diversity filter. The formula is elegantly simple:

Score=λsim(item,query)(1λ)maxSim(item,selected)\text{Score}=\lambda\cdot \mathrm{sim}(\text{item}, \text{query})-(1-\lambda)\cdot \mathrm{maxSim}(\text{item}, \text{selected})

As you select items, each subsequent choice is penalized for being similar to what you’ve already picked, forcing the system to explore different regions of the semantic space.

DPPs (Determinantal Point Processes) offer an alternative approach, modeling diversity as repulsion in embedding space. Items that are too similar repel each other in the selection process. DPPs provide globally optimal diversity rather than MMR’s greedy approach, but they’re computationally expensive and the improvement is often marginal for practical applications. For most systems, MMR provides the right balance of effectiveness and efficiency.

In practice, λ\lambda values between 0.3 and 0.4 give good results for exploration. The key is maintaining tension between relevance and novelty. Set the parameter too low and you get noise — items that are different but meaningless. Too high and you’re back to pure similarity. The optimal range around λ0.35\lambda \approx 0.35 consistently surfaces items that are relevant enough to be useful but different enough to expand understanding.

Performance overhead is real but manageable. MMR adds roughly 2–10x to retrieval time compared to pure top-k, depending on result count and embedding dimensions. In absolute terms, this means 10ms to 50ms for typical queries. Since LLM inference dominates total cost (often 100x retrieval time), this overhead is negligible. The payoff is significant: users iterate less when given richer initial context, which means fewer expensive LLM calls overall.

For research and exploration phases, MMR becomes the default. When users signal they want different perspectives — phrases like “show me alternatives” or “what else should I consider” — divergence mode with λ0.4\lambda \approx 0.4 consistently surfaces valuable alternatives that pure similarity would never find.


Cross-Domain Bridges: Forcing Conceptual Collision

The Problem: Semantic search naturally stays within conceptual neighborhoods. Even with diversity-aware selection, you’re still operating within your current cluster. But breakthrough insights often come from collisions between distant clusters — forcing connections that wouldn’t emerge naturally.

The Solution: Reserve a percentage of retrieval slots for information from semantically distant clusters. Force connections across concept boundaries.

Associative drift enables exploration within conceptual neighborhoods but operates locally. Weak ties connect to adjacent clusters rather than distant regions of the knowledge space. Breakthrough insights often come from unexpected collisions between distant concepts — forced connections that wouldn’t emerge naturally through gradual exploration.

Research on innovation consistently shows this pattern. Nobel laureates are significantly more likely to be polymaths working across disciplines. Highly cited papers are disproportionately interdisciplinary, combining insights from fields that rarely interact. Cities scale superlinearly—double the population and you get more than double the innovation — because collision probability increases with density. When more diverse people and ideas are forced into proximity, unexpected combinations emerge.

The mechanism underlying all these observations is forced conceptual juxtaposition. When you deliberately place distant concepts side by side, you create opportunities for analogical reasoning that wouldn’t emerge from pure similarity. A technique from neuroscience might illuminate a problem in computer architecture. A pattern from evolutionary biology might suggest an approach to market dynamics. These connections aren’t obvious from embeddings alone — they require crossing boundaries.

Clustering algorithms become crucial for divergence. When you build your knowledge graph, clustering algorithms can identify dense regions and sparse bridges between them. The clusters represent distinct conceptual domains — statistics, neuroscience, systems dynamics, each with its own vocabulary and methods. The bridges are what you want to traverse.

In practice, this means maintaining both graph structure and cluster assignments. When a query lands in one cluster — say, “attention mechanisms in transformers” clearly belongs to the machine learning cluster — you identify the query’s primary home and then deliberately reserve a percentage of retrieval slots for nodes from adjacent but distinct clusters. Not random distant nodes, which would just be noise. But nodes connected through bridge edges, meaning there’s some semantic relationship even if it’s weak.

Hierarchical clustering helps here. Agglomerative clustering gives you a tree structure: coarse-grained domains at the top (science, engineering, humanities), finer-grained distinctions as you drill down (within science: physics, chemistry, biology; within biology: molecular, cellular, systems). When retrieving, you can deliberately sample from different branches of this tree. Query about “causal inference” in statistics? Deliberately surface nodes from the systems dynamics cluster, because you’ve identified bridge connections through concepts like “feedback loops” or “temporal dependencies.” These aren’t obvious matches from embeddings, but they represent meaningful cross-domain connections.

The graph structure makes this feasible in ways that pure embeddings cannot. You can run periodic clustering updates as new documents arrive, updating cluster assignments to reflect the evolving knowledge landscape. But the graph edges remain stable, so you can still traverse between clusters even as their definitions shift. This dynamic clustering keeps bridge detection current without requiring complete graph reconstruction. A document might migrate from one cluster to another as its neighborhood changes, but the connections that made it a valuable bridge persist.

The key is calibrating the bridge percentage. Reserve too much (say, 50%) and you dilute relevance to the point where users feel the system isn’t listening to their query. Reserve too little (say, 10%) and bridges become rare enough that users might never encounter them. Around 20–30% seems to be the sweet spot — enough to reliably surface cross-domain connections without overwhelming the results with tangential content.

When this works well, it’s transformative. A researcher exploring transformer attention mechanisms suddenly encounters biological attention in visual cortex — a bridge they’d never have crossed without the system forcing that juxtaposition. The biological framing doesn’t directly answer their question, but it reframes how they think about the computational problem. That’s the power of cross-domain bridges: not just answering questions, but reshaping the questions themselves.


Surfacing Contradictions: Making Disagreement Visible

The Problem: Standard retrieval systems hide disagreement. They optimize for consensus, returning information that agrees with itself and with the implied stance of the query. This seems reasonable — why would you want conflicting information?—until you realize that productive inquiry requires friction. You need to see where claims contradict, where evidence conflicts, where different frameworks make opposing predictions.

The Solution: Track stance metadata and explicitly surface contradictory viewpoints. Treat disagreement as a signal, not noise.

The problem runs deeper than just missing alternative viewpoints. When systems only surface agreement, they create an illusion of certainty. Users receive a coherent narrative and assume it represents settled truth, unaware that vigorous debates exist just outside their retrieval window. This is particularly dangerous in fields where knowledge is genuinely contested — medicine, policy, social sciences — where presenting only one side of an argument isn’t just incomplete, it’s misleading.

In graph-based systems, surfacing contradictions means tracking stance metadata on edges or nodes. When you ingest information claiming “X causes Y,” you don’t just embed that text and move on. You extract the claim, identify its polarity (supports/opposes/neutral), and mark it in the graph. If other information later claims “X does not cause Y,” that’s a contradiction edge. During retrieval, you explicitly reserve slots for these contradictions, ensuring that disagreement becomes visible rather than hidden.

Graph structure makes contradictions more powerful than simple opposition pairs. Contradictions aren’t just isolated disagreements between two papers — they can be chains of disagreement, entire subgraphs representing competing frameworks. The Rubin vs Pearl debates about causal inference aren’t just two contradicting papers; they’re argument trees rooted in different philosophical commitments about what causation means and how it should be measured. By representing these as graph structures rather than isolated nodes, you can surface not just the fact of disagreement but its underlying structure.

The implementation challenge is stance detection. The simple version uses pattern matching for negations (“does not cause”, “fails to show”, “contradicts previous findings”) and basic sentiment analysis. This catches obvious disagreements but misses subtle contradictions. The advanced version uses LLM-based stance classification during ingestion, extracting not just whether a document supports or opposes a claim but the specific grounds of agreement or disagreement. This is more expensive but far more accurate, and it enables richer contradiction surfacing.

One crucial nuance: don’t surface contradictions blindly. Not all disagreements are equally meaningful. Sometimes apparent contradictions are actually refinements — same claim at different levels of precision, same concept from different perspectives, temporal evolution as understanding improves. The contradiction between “vaccines are safe” and “vaccines have side effects” isn’t a genuine disagreement; it’s a difference in framing and specificity.

Treat contradictions as meaningful signals worth exploring. When you surface a contradiction, you’re essentially asking: “Why does this disagree with the prevailing frame?” Sometimes the answer is that one source is wrong or outdated. Sometimes it reveals fundamental differences in methodology or assumptions. Sometimes it points to genuine uncertainty in the field. All of these are valuable for understanding, but they’re valuable in different ways.

The metric that matters is whether contradictions change user behavior. If users engage with contradictory content — reading it, following up with questions about the disagreement, adjusting their conclusions — then contradictions are productive friction. If contradictions get ignored or dismissed, they might just be noise. Track this engagement over time and adjust your contradiction surfacing accordingly. For some queries and domains, aggressive contradiction surfacing (30–40% of results) leads to better outcomes. For others, a light touch (10–15%) is more appropriate.

When this works, it shifts retrieval from answering questions to mapping intellectual terrain. Instead of “here’s what’s true about X,” the system says “here’s what different sources claim about X, and here’s where they fundamentally disagree.” That shift — from authoritative answers to reasoned exploration of disagreement — is what distinguishes systems that support genuine inquiry from those that merely confirm priors.


Temporal Novelty: Preventing Burial of the New

The Problem: Vector similarity is temporally blind. When you embed documents and measure cosine distance, a paper from 2015 and a paper from 2024 are judged purely on semantic content, with no consideration of when they were created. This seems fair — ideas should be evaluated on merit, not recency — until you realize that it systematically buries new information under established patterns.

The Solution: Boost recently added documents that are semantically offset from the current frame — to prioritize information that is both new and different.

Here’s why: old concepts are well-integrated into the corpus. They’ve been cited repeatedly, referenced from multiple angles, embedded in the vocabulary of the field. New concepts haven’t had time to establish these connections yet. They might use different terminology, reference different prior work, or approach problems from angles that don’t match existing embeddings well. Pure similarity search favors the old and familiar, even when the new and unfamiliar is more relevant to current understanding.

This creates a temporal lag in knowledge systems. New breakthroughs don’t surface in retrieval until they’ve been around long enough to become well-connected. By the time they’re easy to find, they’re no longer new. Users asking about current understanding get answers from last year or the year before, unaware that recent work has shifted the landscape.

The solution is to boost recently added documents that are semantically offset from the current frame — to prioritize information that is both new and different. This isn’t just recency bias (which would surface all new documents regardless of relevance) but temporal novelty: new information that introduces concepts the existing corpus hasn’t assimilated yet.

The implementation combines two signals: recency and semantic distance. Recency measures how long ago the document was ingested, with a decay function that reduces influence over time. Semantic distance measures how far the document is from the corpus center — documents with unusual vocabulary or concepts that don’t match existing patterns get higher novelty scores. Multiply these together and you get a boost that favors documents that are both recent and semantically distinctive.

The decay window should match your domain. For fast-moving fields like machine learning or medicine, a 30–60 day window makes sense — information older than two months is probably well-integrated already. For slower-moving fields like philosophy or history, extend to 180 days or more. The key is calibrating so that documents get a boost during their “assimilation period”—the time between when they’re added and when they become well-connected to the rest of the corpus.

Track ingestion timestamps at document processing time, not publication dates. What matters for this primitive isn’t when something was published but when it entered your system. A 2020 paper added yesterday is “new” from the system’s perspective, because it hasn’t had time to become well-integrated yet. This distinction matters especially for systems that continuously ingest new sources — arxiv feeds, news aggregators, research databases. The temporal boost helps recent additions compete with established content even before they’ve accumulated the citation network and semantic connections that would make them easy to find through pure similarity.

When this works well, it prevents the problem of knowledge systems becoming historical archives. Users asking about current understanding get current answers, not consensus from two years ago. Breaking developments surface quickly, even when they use unfamiliar framing or terminology. The system stays fresh rather than ossifying around established concepts.


Entropy Control: Maintaining Variance as a First-Class Metric

The Problem: Without explicit variance management, retrieved sets become increasingly redundant over time. Pure similarity search pulls toward the center of mass — the densest, most well-represented concepts in your corpus. As you select each document, the next selection is biased toward documents similar to what you’ve already chosen. Over iterations, this creates a feedback loop where the result set collapses into a narrow region of semantic space.

The Solution: Measure and maintain Shannon entropy across retrieved sets. When variance drops below threshold, inject diversity.

The mathematical formalization of this problem is entropy. Shannon entropy measures the information content of a distribution. High entropy means high diversity — the distribution is spread across many states. Low entropy means low diversity — the distribution is concentrated in a few states. As retrieval systems iterate without variance control, entropy naturally decays. The first few results might be diverse, but by result 15 or 20, you’re just getting minor variations on the same core concepts.

Entropy control means measuring and maintaining minimum variance across retrieved sets. When entropy drops below a threshold, inject diversity explicitly. This isn’t about replacing relevant results with random ones — it’s about recognizing when your result set has become redundant and taking corrective action before the user is buried in near-duplicates.

The implementation projects high-dimensional embeddings down to 2D or 3D for entropy calculation — you can’t directly histogram 768-dimensional vectors. PCA or UMAP work well for this. Then create a histogram binning the reduced embeddings and compute Shannon entropy over the distribution. If adding the next candidate would drop entropy below your threshold, skip it and try the next candidate instead.

This might seem subtle, but the impact is significant. Consider a query about “machine learning optimization methods.” Pure similarity returns: gradient descent, stochastic gradient descent, mini-batch gradient descent, Adam optimizer, RMSprop, AdaGrad — all variations on gradient-based optimization. Entropy control would recognize this clustering and inject something different: genetic algorithms, simulated annealing, or grid search. These aren’t as similar to the query, but they represent genuinely different approaches that expand the user’s understanding of the optimization landscape.

The threshold tuning is domain-specific. For technical queries where precision matters, set higher thresholds (entropy > 3.0)—you want some redundancy because subtle differences matter. For exploratory queries where breadth matters, set lower thresholds (entropy > 2.0)—you’re willing to sacrifice some relevance for greater diversity. Track entropy across different queries and domains to calibrate these thresholds empirically.

Entropy control becomes especially important in multi-turn conversations where context window fills with previous exchanges. Without variance management, each turn tends to narrow focus further. Entropy control provides a counterpressure, maintaining enough diversity that the conversation doesn’t collapse into a single track. It’s the difference between a conversation that spirals into finer and finer detail about one narrow topic versus one that maintains breadth while still going deep.


Negative Space: Making Absence Visible

The Problem: Systems only show what’s present, never what’s absent. But gaps in knowledge are often more important than what exists. Knowing what you don’t know — what questions go unanswered, what concepts are under-represented, what connections are missing — is crucial for genuine understanding. Yet standard retrieval provides no mechanism for surfacing these absences.

The Solution: Explicitly surface structural absences. What concepts should connect but don’t? What questions go unanswered? What areas are under-represented?

The challenge is that absence is structurally different from presence. You can search for documents about X. You cannot directly search for documents that should exist but don’t, or concepts that should be represented but aren’t. Negative space requires inference: comparing what you have against what you should have, and making the gaps explicit.

One approach is coverage analysis. If you have a domain ontology or taxonomy — a structured map of what concepts exist in a field — you can compare your retrieval results against expected coverage. Query about “machine learning” should probably touch on supervised learning, unsupervised learning, reinforcement learning, and deep learning. If your results cover only supervised and deep learning, the gaps (unsupervised, reinforcement) become negative space signals worth surfacing.

Another approach is query pattern analysis. Track questions that get poor results or topics where retrieval consistently fails. If users frequently ask about “attention in multimodal models” but your corpus has little content on this, that’s a gap worth flagging. The system can explicitly tell users: “I found limited information on this topic in my knowledge base. This might be an area where coverage is weak.”

The hardest case is structural absences — concepts that should connect but don’t. These require reasoning about the graph structure itself. If you have strong clusters around “transformers” and “reinforcement learning” but no bridge edges between them, that might indicate a gap: shouldn’t there be work on transformer-based RL? This kind of negative space detection is more speculative but potentially very valuable for research planning.

When you surface negative space, you’re changing the nature of the interaction. Instead of “here are the answers,” you’re saying “here’s what I know, and here’s where my knowledge has holes.” This transparency helps users calibrate trust and identifies areas where they might need to search elsewhere or recognize that the field itself hasn’t developed certain connections yet.

The implementation challenge is having domain knowledge to compare against. You need either predefined ontologies/taxonomies for your domain, or user feedback about what’s missing, or analysis of failed query patterns. This makes negative space one of the harder primitives to implement well. But when done right, it shifts retrieval from providing answers to mapping the topology of knowledge — where it’s dense, where it’s sparse, where the uncharted territories lie.


Write-Back: Learning from Reasoning Traces

The Problem: Most retrieval systems are write-only. New information gets added to the corpus, embeddings get computed, the graph grows — but existing knowledge never reorganizes. The graph becomes an accumulation of content, not a living structure that learns from how it’s actually used.

The Solution: When agents discover surprising connections during reasoning, persist those connections back into the knowledge graph. Let the graph evolve based on actual usage patterns.

Write-back changes this by persisting discoveries made during reasoning back into the knowledge graph. When an agent or user finds a surprising connection during exploration — concept A turns out to relate to concept B in ways not obvious from embeddings alone — that discovery is valuable. It reveals something about knowledge structure that the initial ingestion process missed. Write-back means capturing these discoveries and updating the graph accordingly.

The insight is that reasoning traces reveal structural relationships pure similarity doesn’t capture. When an agent successfully solves a problem by connecting concepts A and B, that connection has been validated by use. It’s not just that A and B are semantically similar (they might not be). It’s that A and B are functionally related — bringing them together was productive for a real task. That’s signal worth preserving.

Especially when working with agents, this becomes powerful. The value isn’t just in the paths taken but in the turns not taken — the questions that sparked deeper questions, the moments when understanding shifted. When an agent explores possibility space and makes a connection, that connection reveals something about the structure of knowledge that wasn’t captured in the initial embeddings. Those reasoning traces are gold: they show you where conceptual bridges actually exist versus where they should exist in theory.

Modern graph construction uses LLMs for node creation and edge labeling. Frame semantics provide the structure — rather than just embedding document text, you extract semantic frames: events, participants, relationships, stances. A document about “causal inference” isn’t just a blob of embeddings; it’s a frame with participants (researcher, phenomenon), conditions (controlled vs observational), stances (support vs skeptical). The graph structure captures these frames, and write-back captures how those frames connect during actual reasoning.

When an agent makes a surprising connection, write-back triggers cascades. The new edge might contradict existing beliefs, suggest new bridge connections, strengthen weak edges that proved useful, or identify edges that never get traversed. This is where clustering algorithms become essential for maintenance — they help identify which regions of the graph are still coherent after new edges arrive, and which nodes have drifted so far from their original clusters that they should be flagged for pruning.

The crucial insight for pruning: measure edge traversal over time, not just current similarity. An edge might connect two nodes that aren’t semantically similar right now, but if reasoning consistently traverses that edge, it’s telling you something about the structure of thought that embeddings alone missed. Use clustering to suggest what might be pruned, but let traversal frequency guide the actual decision.

This is essentially Hebbian learning for knowledge graphs: “nodes that fire together wire together.” But crucially, we’re not reinforcing all co-activations — that would just strengthen the densest parts of the graph. Instead, track surprising co-occurrences: nodes that get retrieved together despite low semantic similarity. Two sources:

  1. Agent reasoning patterns: If an agent consistently traverses from concept A to concept B during successful reasoning, strengthen that edge — even if their embedding distance is high. The connection proved functionally useful.

  2. User query patterns: If users frequently ask about two concepts together despite structural distance in the graph, that’s signal they should connect. Users reveal relationships the ingestion process missed.

Co-activation of semantically distant nodes reveals functional relationships that similarity alone cannot capture. The Hebbian mechanism should persist divergent discoveries, not convergent ones.

Write-back cascades can explode if you’re not careful. If every new edge triggers re-clustering and re-evaluation of all related edges, you’ll spend all your compute on graph maintenance. Batch updates, run clustering asynchronously, and only trigger deep cascades for high-confidence connections. Critical safety mechanisms include requiring confidence thresholds (only connections with >0.7 confidence get persisted), implementing rollback capabilities, and having human review for high-impact structural changes.

When this works, it turns memory systems from passive archives into active learners. The graph doesn’t just accumulate content — it learns from how it’s used, reorganizing itself based on what actually proves valuable during reasoning. This is the difference between a knowledge base that grows endlessly and one that evolves intelligently, becoming more useful over time rather than just larger.

Measuring Success

Without metrics, exploration gets dismissed as expensive wandering. But divergence can and should be measured.

ΔEntropy (Variance Preservation)

What it measures: Are retrieved sets becoming more homogeneous over time?

How to measure:

  • Calculate Shannon entropy over embedding space for retrieved sets
  • Compare diverge mode vs converge mode
  • Track entropy across multiple retrieval iterations

Target: Divergence mode should show >0.3 nats increase vs baseline

What it tells you: Confirms you’re actually surfacing diversity, not just noise. If diverge mode doesn’t increase entropy, your primitives aren’t working.


Bridge Yield

What it measures: Do cross-cluster retrievals lead to useful results?

How to measure:

  • Track which bridge items users interact with vs ignore
  • Measure click-through rate on bridge vs same-cluster items
  • Monitor follow-up queries that reference bridge concepts

Target: >20% engagement rate on bridge items

What it tells you: Validates that cross-domain connections provide value, not just distraction. Low bridge yield suggests bridges are noise, not insight.


Contradiction Lift

What it measures: Does surfacing disagreement change outcomes?

How to measure:

  • A/B test: same query with/without contradictions surfaced
  • Track whether users engage with contradictory content
  • Measure whether contradictions lead to different conclusions

Target: Contradictions change user’s next action >30% of time

What it tells you: Shows that friction is productive, not just confusing. If contradictions get ignored, they’re not useful.


Time-to-Novelty

What it measures: How quickly does the system surface truly new information?

How to measure:

  • Track iterations until entropy increases vs previous result
  • Measure semantic distance of Nth result from first N-1 results
  • Time until user engages with something semantically distant from initial query

Target: Diverge mode reaches novelty in <3 iterations vs >10 for converge

What it tells you: Efficiency of exploration matters — divergence shouldn’t require exhaustive search to find novelty.


User Acceptance

What it measures: Do users iterate less when given broader initial context?

How to measure:

  • Count edits/prompts with divergence vs without
  • Track session length and query reformulations
  • Measure acceptance rate of generated content

Target: Diverge mode reduces iterations by 30–50%

What it tells you: The ultimate validation — does divergence actually help users accomplish tasks more effectively?


Safety Check: Noise vs Exploration

How to distinguish:

Noise looks like:

  • Low relevance (<0.4) AND no user engagement
  • High entropy but users ignore divergent results
  • Bridge items get clicked but immediately abandoned

Exploration looks like:

  • Medium relevance (0.5–0.7) AND high user engagement
  • Users follow up on divergent results with new queries
  • Bridge items spark connections users didn’t anticipate

Mitigation:

  1. Minimum relevance threshold: Never return doc with rel(q,d) < 0.4, regardless of λ\lambda
  2. User feedback: Track which divergent results get used vs ignored
  3. Gradual λ\lambda increase: Start conservative (0.2), increase as users engage with divergent results

What Divergence Looks Like: Concrete Example

Let’s make this concrete with a realistic comparison.

Scenario: User researching “attention mechanisms in transformers”

STANDARD RAG (convergence-only, λ=0\lambda=0):

Analysis: High precision (avg relevance: 0.88), zero diversity. Every result explains the same mechanism in different words. User gets a solid understanding of standard attention, but no exposure to alternatives, critiques, or related concepts.


DIVERGENCE-ENABLED (λ=0.3\lambda=0.3):

Analysis: Lower average relevance (0.78 vs 0.88) but massively higher reach. User discovers:

  • Biological parallels (computational models inspired by neuroscience)
  • Critiques of standard approach (quadratic complexity problems)
  • Extensions to different domains (graph attention)
  • Recent efficiency improvements (Flash Attention)
  • Awareness of knowledge gaps (multimodal attention not covered)

User’s likely next query: “How do graph attention networks compare to sequence attention?”

This follow-up query wouldn’t have been possible without the divergence-enabled retrieval. The bridge to GATs opened a new line of inquiry the user didn’t know existed.

Addressing Objections

“This sounds expensive”

Yes. Divergence costs more than pure similarity search:

  • MMR: 2–10x slower than top-k (but still fast in absolute terms)
  • Cross-domain bridges: requires clustering infrastructure
  • Contradiction detection: needs stance tracking pipeline
  • Write-back: periodic graph maintenance overhead

But:

  • Total cost dominated by LLM inference (10–100x retrieval cost), not retrieval
  • 10ms → 50ms retrieval latency invisible to users
  • Cost is frontloaded (exploration phase), saves iteration cycles downstream
  • Users iterate less when initial context is richer

The trade-off: Fewer, richer retrievals instead of many narrow ones.

Real-world impact: If divergence reduces user iterations by 30%, you’re doing 30% fewer LLM calls. The retrieval overhead is negligible compared to LLM savings.


“How do I know it’s not just returning noise?”

Three safety mechanisms prevent noise:

  1. Minimum relevance threshold: Never return doc with rel(q,d) < 0.4, regardless of λ\lambda

    • This ensures even “divergent” results have baseline relevance
    • Prevents random documents from appearing
  2. User feedback loops: Track which divergent results get engaged with vs ignored

    • High engagement = valuable exploration
    • Low engagement = noise or poor timing
    • Adjust λ\lambda based on engagement patterns
  3. Gradual λ\lambda increase: Start conservative (0.2), increase as users engage with divergent results