Knowledge Graphs: Structured Reasoning
Dec 30, 2025 • 20 min read
Vector databases find similar text. Knowledge graphs find connected concepts. A knowledge graph stores the world not as paragraphs of text but as a network of named entities linked by typed relationships — (Apple)-[:ACQUIRED]->(Beats Electronics). This structure enables multi-hop reasoning that is impossible with vector search and unlocks GraphRAG: the most powerful form of Retrieval-Augmented Generation for domains with complex, interconnected information.
1. Knowledge Graph Fundamentals
A knowledge graph is built from triples: (Subject, Predicate, Object) — also called (head, relation, tail). Everything in the graph is a triple:
(Elon Musk) -[FOUNDED]-> (SpaceX)(SpaceX) -[OPERATES]-> (Falcon 9)(Falcon 9) -[COMPETES_WITH]-> (Ariane 6)
The power is in traversal: "Who competes with companies Elon Musk founded?" — answer requires 3 hops that a vector DB cannot traverse.
2. Step 1: LLM-Based Triple Extraction
from openai import OpenAI
import json
client = OpenAI()
EXTRACTION_PROMPT = """You are an expert information extraction system.
Extract ALL entity-relationship triples from the text.
Rules:
- Use UPPERCASE_WITH_UNDERSCORES for relation types (e.g., CEO_OF, FOUNDED, ACQUIRED_BY)
- Entity types: Person, Company, Product, Place, Event, Concept, Technology
- Include metadata attributes (year, amount, location) as properties
- Be exhaustive — extract every factual relationship mentioned
Text: {text}
Return JSON:
{{"triples": [
{{
"head": "entity name",
"head_type": "EntityType",
"relation": "RELATION_TYPE",
"tail": "entity name",
"tail_type": "EntityType",
"properties": {{"year": "2002", "amount": "$1.5B"}}
}}
]}}"""
def extract_triples(text: str) -> list[dict]:
response = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[{
"role": "user",
"content": EXTRACTION_PROMPT.format(text=text)
}],
temperature=0, # Deterministic extraction
)
return json.loads(response.choices[0].message.content)["triples"]
# Example:
text = """In 2002, Elon Musk founded SpaceX with $100 million of his own money
from his PayPal sale. SpaceX is headquartered in Hawthorne, California,
and competes directly with Boeing's United Launch Alliance."""
triples = extract_triples(text)
# Returns:
# [{"head": "Elon Musk", "head_type": "Person", "relation": "FOUNDED",
# "tail": "SpaceX", "tail_type": "Company", "properties": {"year": "2002", "amount": "$100M"}},
# {"head": "SpaceX", "head_type": "Company", "relation": "HEADQUARTERED_IN",
# "tail": "Hawthorne, California", "tail_type": "Place"},
# {"head": "SpaceX", "head_type": "Company", "relation": "COMPETES_WITH",
# "tail": "United Launch Alliance", "tail_type": "Company"}]3. Step 2: Entity Resolution (The Hard Part)
The LLM might extract "Elon Musk", "Elon", "Mr. Musk", and "Musk" as four different entities from different text passages. Without resolving these to a single canonical node, your graph becomes a fragmented disconnected mess:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(model="text-embedding-3-small", input=text)
return response.data[0].embedding
def resolve_entities(entities: list[str], threshold: float = 0.92) -> dict[str, str]:
"""
Cluster entities by embedding similarity.
Returns mapping: 'Elon' -> 'Elon Musk' (canonical form)
"""
embeddings = np.array([get_embedding(e) for e in entities])
similarity_matrix = cosine_similarity(embeddings)
# Union-Find to group similar entities
parent = {i: i for i in range(len(entities))}
def find(x):
while parent[x] != x:
parent[x] = parent[parent[x]]
x = parent[x]
return x
def union(x, y):
px, py = find(x), find(y)
if px != py:
# Keep the longer name as canonical (usually more specific)
if len(entities[px]) >= len(entities[py]):
parent[py] = px
else:
parent[px] = py
for i in range(len(entities)):
for j in range(i + 1, len(entities)):
if similarity_matrix[i][j] > threshold:
union(i, j)
# Build mapping: variant → canonical
canonical = {entities[i]: entities[find(i)] for i in range(len(entities))}
return canonical
# Usage:
raw_entities = ["Elon Musk", "Elon", "Mr. Musk", "Tesla", "Tesla Inc.", "TSLA"]
resolution_map = resolve_entities(raw_entities)
# {"Elon": "Elon Musk", "Mr. Musk": "Elon Musk", "Tesla Inc.": "Tesla", "TSLA": "Tesla"}4. Step 3: Storing Triples in Neo4j
from neo4j import GraphDatabase
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
def store_triple(triple: dict, resolution_map: dict):
"""MERGE = create node if not exists, skip if already present."""
head = resolution_map.get(triple["head"], triple["head"])
tail = resolution_map.get(triple["tail"], triple["tail"])
with driver.session() as session:
session.run(f"""
MERGE (h:{triple['head_type']} {{name: $head}})
MERGE (t:{triple['tail_type']} {{name: $tail}})
MERGE (h)-[r:{triple['relation']}]->(t)
SET r += $props
""",
head=head,
tail=tail,
props=triple.get("properties", {})
)
# Pipeline: extract → resolve → store
def process_document(text: str):
# Extract raw triples
raw_triples = extract_triples(text)
# Collect all entities for resolution
all_entities = list(set(
[t["head"] for t in raw_triples] + [t["tail"] for t in raw_triples]
))
# Resolve entity variants to canonical forms
resolution_map = resolve_entities(all_entities)
# Store resolved triples
for triple in raw_triples:
store_triple(triple, resolution_map)
print(f"Stored {len(raw_triples)} triples from {len(all_entities)} entities")5. Graph Quality Validation
# Check graph health after ingestion
with driver.session() as session:
# Isolated nodes (no relationships) — usually bad extraction
result = session.run("""
MATCH (n) WHERE NOT (n)--() RETURN count(n) as isolated
""")
print(f"Isolated nodes: {result.single()['isolated']}") # Should be < 5%
# Most connected nodes (your graph's hubs)
result = session.run("""
MATCH (n)-[r]-()
RETURN n.name, count(r) as degree
ORDER BY degree DESC LIMIT 10
""")
for record in result:
print(f"{record['n.name']}: {record['degree']} connections")
# Relationship type distribution
result = session.run("""
MATCH ()-[r]->()
RETURN type(r) as rel_type, count(*) as count
ORDER BY count DESC
""")
for record in result:
print(f"{record['rel_type']}: {record['count']}")6. LangChain LLMGraphTransformer (Automated Pipeline)
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI
from langchain_community.graphs import Neo4jGraph
from langchain_core.documents import Document
# LangChain wraps the entire pipeline
llm = ChatOpenAI(model="gpt-4o", temperature=0)
graph = Neo4jGraph(url="bolt://localhost:7687", username="neo4j", password="password")
transformer = LLMGraphTransformer(
llm=llm,
allowed_nodes=["Person", "Company", "Product", "Place"], # Optional: restrict types
allowed_relationships=["FOUNDED", "CEO_OF", "ACQUIRED", "COMPETES_WITH"],
)
docs = [Document(page_content="Elon Musk founded SpaceX in 2002...")]
graph_docs = transformer.convert_to_graph_documents(docs)
graph.add_graph_documents(graph_docs, baseEntityLabel=True, include_source=True)Frequently Asked Questions
How do I handle extraction errors and hallucinations?
Use temperature=0 for deterministic extraction. Implement a validation step: for any triple where you're uncertain, require the LLM to quote the exact sentence that supports the triple. If it can't quote a supporting sentence, discard the triple. This dramatically reduces hallucinated relationships.
How many documents can I process before the graph becomes unusable?
Neo4j Community Edition handles hundreds of millions of nodes and relationships efficiently. The bottleneck is usually LLM API cost for extraction (~$0.01 per document for GPT-4o) and entity resolution compute. For large document collections (>10,000 docs), use a fine-tuned smaller model (Llama 3.1 8B) for extraction to reduce costs by 10-50x.
Conclusion
Knowledge graph construction transforms unstructured text into a navigable network of facts, enabling reasoning capabilities that are impossible with vector search alone. The pipeline — LLM extraction → entity resolution → graph storage → quality validation — is straightforward to implement with the code above. For complex domains like legal research, biomedical literature, and financial analysis, the structured relationships in a knowledge graph are the difference between an agent that approximates and one that reasons precisely.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.