opncrafter

Knowledge Graphs: Structured Reasoning

Dec 30, 2025 • 20 min read

Vector databases find similar text. Knowledge graphs find connected concepts. A knowledge graph stores the world not as paragraphs of text but as a network of named entities linked by typed relationships — (Apple)-[:ACQUIRED]->(Beats Electronics). This structure enables multi-hop reasoning that is impossible with vector search and unlocks GraphRAG: the most powerful form of Retrieval-Augmented Generation for domains with complex, interconnected information.

1. Knowledge Graph Fundamentals

A knowledge graph is built from triples: (Subject, Predicate, Object) — also called (head, relation, tail). Everything in the graph is a triple:

  • (Elon Musk) -[FOUNDED]-> (SpaceX)
  • (SpaceX) -[OPERATES]-> (Falcon 9)
  • (Falcon 9) -[COMPETES_WITH]-> (Ariane 6)

The power is in traversal: "Who competes with companies Elon Musk founded?" — answer requires 3 hops that a vector DB cannot traverse.

2. Step 1: LLM-Based Triple Extraction

from openai import OpenAI
import json

client = OpenAI()

EXTRACTION_PROMPT = """You are an expert information extraction system.
Extract ALL entity-relationship triples from the text.

Rules:
- Use UPPERCASE_WITH_UNDERSCORES for relation types (e.g., CEO_OF, FOUNDED, ACQUIRED_BY)
- Entity types: Person, Company, Product, Place, Event, Concept, Technology
- Include metadata attributes (year, amount, location) as properties
- Be exhaustive — extract every factual relationship mentioned

Text: {text}

Return JSON:
{{"triples": [
    {{
        "head": "entity name",
        "head_type": "EntityType",
        "relation": "RELATION_TYPE",
        "tail": "entity name",
        "tail_type": "EntityType",
        "properties": {{"year": "2002", "amount": "$1.5B"}}
    }}
]}}"""

def extract_triples(text: str) -> list[dict]:
    response = client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_object"},
        messages=[{
            "role": "user",
            "content": EXTRACTION_PROMPT.format(text=text)
        }],
        temperature=0,  # Deterministic extraction
    )
    return json.loads(response.choices[0].message.content)["triples"]

# Example:
text = """In 2002, Elon Musk founded SpaceX with $100 million of his own money 
from his PayPal sale. SpaceX is headquartered in Hawthorne, California, 
and competes directly with Boeing's United Launch Alliance."""

triples = extract_triples(text)
# Returns:
# [{"head": "Elon Musk", "head_type": "Person", "relation": "FOUNDED",
#   "tail": "SpaceX", "tail_type": "Company", "properties": {"year": "2002", "amount": "$100M"}},
#  {"head": "SpaceX", "head_type": "Company", "relation": "HEADQUARTERED_IN",
#   "tail": "Hawthorne, California", "tail_type": "Place"},
#  {"head": "SpaceX", "head_type": "Company", "relation": "COMPETES_WITH",
#   "tail": "United Launch Alliance", "tail_type": "Company"}]

3. Step 2: Entity Resolution (The Hard Part)

The LLM might extract "Elon Musk", "Elon", "Mr. Musk", and "Musk" as four different entities from different text passages. Without resolving these to a single canonical node, your graph becomes a fragmented disconnected mess:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(model="text-embedding-3-small", input=text)
    return response.data[0].embedding

def resolve_entities(entities: list[str], threshold: float = 0.92) -> dict[str, str]:
    """
    Cluster entities by embedding similarity.
    Returns mapping: 'Elon' -> 'Elon Musk' (canonical form)
    """
    embeddings = np.array([get_embedding(e) for e in entities])
    similarity_matrix = cosine_similarity(embeddings)
    
    # Union-Find to group similar entities
    parent = {i: i for i in range(len(entities))}
    
    def find(x):
        while parent[x] != x:
            parent[x] = parent[parent[x]]
            x = parent[x]
        return x
    
    def union(x, y):
        px, py = find(x), find(y)
        if px != py:
            # Keep the longer name as canonical (usually more specific)
            if len(entities[px]) >= len(entities[py]):
                parent[py] = px
            else:
                parent[px] = py
    
    for i in range(len(entities)):
        for j in range(i + 1, len(entities)):
            if similarity_matrix[i][j] > threshold:
                union(i, j)
    
    # Build mapping: variant → canonical
    canonical = {entities[i]: entities[find(i)] for i in range(len(entities))}
    return canonical

# Usage:
raw_entities = ["Elon Musk", "Elon", "Mr. Musk", "Tesla", "Tesla Inc.", "TSLA"]
resolution_map = resolve_entities(raw_entities)
# {"Elon": "Elon Musk", "Mr. Musk": "Elon Musk", "Tesla Inc.": "Tesla", "TSLA": "Tesla"}

4. Step 3: Storing Triples in Neo4j

from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

def store_triple(triple: dict, resolution_map: dict):
    """MERGE = create node if not exists, skip if already present."""
    head = resolution_map.get(triple["head"], triple["head"])
    tail = resolution_map.get(triple["tail"], triple["tail"])
    
    with driver.session() as session:
        session.run(f"""
            MERGE (h:{triple['head_type']} {{name: $head}})
            MERGE (t:{triple['tail_type']} {{name: $tail}})
            MERGE (h)-[r:{triple['relation']}]->(t)
            SET r += $props
            """,
            head=head,
            tail=tail,
            props=triple.get("properties", {})
        )

# Pipeline: extract → resolve → store
def process_document(text: str):
    # Extract raw triples
    raw_triples = extract_triples(text)
    
    # Collect all entities for resolution
    all_entities = list(set(
        [t["head"] for t in raw_triples] + [t["tail"] for t in raw_triples]
    ))
    
    # Resolve entity variants to canonical forms
    resolution_map = resolve_entities(all_entities)
    
    # Store resolved triples
    for triple in raw_triples:
        store_triple(triple, resolution_map)
    
    print(f"Stored {len(raw_triples)} triples from {len(all_entities)} entities")

5. Graph Quality Validation

# Check graph health after ingestion
with driver.session() as session:
    
    # Isolated nodes (no relationships) — usually bad extraction
    result = session.run("""
        MATCH (n) WHERE NOT (n)--() RETURN count(n) as isolated
    """)
    print(f"Isolated nodes: {result.single()['isolated']}")  # Should be < 5%
    
    # Most connected nodes (your graph's hubs)
    result = session.run("""
        MATCH (n)-[r]-()
        RETURN n.name, count(r) as degree
        ORDER BY degree DESC LIMIT 10
    """)
    for record in result:
        print(f"{record['n.name']}: {record['degree']} connections")
    
    # Relationship type distribution
    result = session.run("""
        MATCH ()-[r]->()
        RETURN type(r) as rel_type, count(*) as count
        ORDER BY count DESC
    """)
    for record in result:
        print(f"{record['rel_type']}: {record['count']}")

6. LangChain LLMGraphTransformer (Automated Pipeline)

from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI
from langchain_community.graphs import Neo4jGraph
from langchain_core.documents import Document

# LangChain wraps the entire pipeline
llm = ChatOpenAI(model="gpt-4o", temperature=0)
graph = Neo4jGraph(url="bolt://localhost:7687", username="neo4j", password="password")

transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Person", "Company", "Product", "Place"],  # Optional: restrict types
    allowed_relationships=["FOUNDED", "CEO_OF", "ACQUIRED", "COMPETES_WITH"],
)

docs = [Document(page_content="Elon Musk founded SpaceX in 2002...")]
graph_docs = transformer.convert_to_graph_documents(docs)
graph.add_graph_documents(graph_docs, baseEntityLabel=True, include_source=True)

Frequently Asked Questions

How do I handle extraction errors and hallucinations?

Use temperature=0 for deterministic extraction. Implement a validation step: for any triple where you're uncertain, require the LLM to quote the exact sentence that supports the triple. If it can't quote a supporting sentence, discard the triple. This dramatically reduces hallucinated relationships.

How many documents can I process before the graph becomes unusable?

Neo4j Community Edition handles hundreds of millions of nodes and relationships efficiently. The bottleneck is usually LLM API cost for extraction (~$0.01 per document for GPT-4o) and entity resolution compute. For large document collections (>10,000 docs), use a fine-tuned smaller model (Llama 3.1 8B) for extraction to reduce costs by 10-50x.

Conclusion

Knowledge graph construction transforms unstructured text into a navigable network of facts, enabling reasoning capabilities that are impossible with vector search alone. The pipeline — LLM extraction → entity resolution → graph storage → quality validation — is straightforward to implement with the code above. For complex domains like legal research, biomedical literature, and financial analysis, the structured relationships in a knowledge graph are the difference between an agent that approximates and one that reasons precisely.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK