Graph RAG: Knowledge Graphs with Neo4j
Dec 29, 2025 β’ 22 min read
Vector databases answer "What is similar?" with remarkable precision. They cannot answer "How is this connected to that?" β a fundamentally different kind of question that requires traversing explicit relationships. GraphRAG combines the embedding-based retrieval of vector databases with the relationship-traversal power of graph databases, enabling AI agents to answer multi-hop questions that would stumble or hallucinate-through with vector search alone.
1. The Limitation of Pure Vector RAG
Consider this question: "Which investors funded companies that later competed with the startup founded by the person who wrote the AlphaGo paper?"
A vector DB retrieves chunks based on semantic similarity to this query. But answering the question requires traversing a path through the graph: Person β wrote β Paper β led_to β Startup β competed_with β Companies β funded_by β Investors. Without explicit relationship storage, an LLM has to hallucinate this chain from its training data β unreliably.
2. Neo4j Basics: Nodes, Relationships, and Cypher
# Install Neo4j locally (Docker is the fastest way)
docker run -d \
--name neo4j \
-p 7474:7474 -p 7687:7687 \ # 7474 = browser UI, 7687 = Bolt protocol
-e NEO4J_AUTH=neo4j/yourpassword \
neo4j:5.15.0
# Access Neo4j Browser at: http://localhost:7474
# Cypher basics β the query language for Neo4j
# Create nodes:
CREATE (elon:Person {name: "Elon Musk", born: 1971})
CREATE (tesla:Company {name: "Tesla", founded: 2003})
CREATE (spacex:Company {name: "SpaceX", founded: 2002})
# Create relationships:
CREATE (elon)-[:CEO_OF]->(tesla)
CREATE (elon)-[:FOUNDER_OF]->(spacex)
CREATE (tesla)-[:COMPETES_WITH]->(rivian:Company {name: "Rivian"})
# Multi-hop query (the power of graphs):
MATCH (p:Person)-[:FOUNDER_OF]->(c:Company)-[:COMPETES_WITH]->(competitor:Company)
WHERE p.name = "Elon Musk"
RETURN p.name, c.name, competitor.name
# Output: Elon Musk | SpaceX | Blue Origin (if in graph)
# Elon Musk | Tesla | Rivian3. Step 1: Entity & Relationship Extraction from Text
The first step in building a knowledge graph is extracting structured triples (subject β relation β object) from unstructured text using an LLM:
from openai import OpenAI
import json
from neo4j import GraphDatabase
client = OpenAI()
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
def extract_triples(text: str) -> list[dict]:
"""Use LLM to extract entity-relationship triples from text."""
response = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[{
"role": "user",
"content": f"""Extract all entity-relationship triples from this text.
Text: "{text}"
Return JSON: {{"triples": [
{{"subject": "entity_name", "subject_type": "Person|Company|Product|Place|Event",
"relation": "RELATIONSHIP_TYPE_UPPERCASE",
"object": "entity_name", "object_type": "Person|Company|Product|Place|Event"}}
]}}
Be specific with relation types: FOUNDED, ACQUIRED, CEO_OF, INVESTED_IN, COMPETES_WITH, etc."""
}]
)
return json.loads(response.choices[0].message.content)["triples"]
def store_triples_in_neo4j(triples: list[dict]):
"""MERGE ensures no duplicate nodes are created."""
with driver.session() as session:
for triple in triples:
session.run("""
MERGE (s {name: $subject_name})
SET s:$subject_type
MERGE (o {name: $object_name})
SET o:$object_type
MERGE (s)-[r:$relation_type]->(o)
""",
subject_name=triple["subject"],
subject_type=triple["subject_type"],
object_name=triple["object"],
object_type=triple["object_type"],
relation_type=triple["relation"]
)4. Step 2: GraphRAG Retrieval
When a user asks a question, generate a Cypher query to retrieve the relevant subgraph, then pass it as context to the LLM:
def text_to_cypher(question: str, schema: str) -> str:
"""Use LLM to translate natural language question to Cypher."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": f"""You are an expert at writing Neo4j Cypher queries.
Graph schema: {schema}
Convert the user's question to a Cypher query.
Return ONLY the Cypher query, no explanation."""
}, {
"role": "user",
"content": question
}]
)
return response.choices[0].message.content.strip()
def graph_rag_query(question: str) -> str:
# Step 1: Generate Cypher
schema = "(:Person)-[:FOUNDER_OF]->(:Company), (:Company)-[:ACQUIRED]->(:Company)..."
cypher = text_to_cypher(question, schema)
# Step 2: Execute against Neo4j
with driver.session() as session:
result = session.run(cypher)
subgraph_data = [dict(record) for record in result]
# Step 3: Pass subgraph to LLM as context
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "Answer the question using only the graph data provided."
}, {
"role": "user",
"content": f"Graph data: {json.dumps(subgraph_data)}\n\nQuestion: {question}"
}]
)
return response.choices[0].message.content5. LangChain GraphCypherQAChain
LangChain provides a pre-built GraphRAG chain that handles schema extraction, Cypher generation, and answer synthesis:
from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain
from langchain_openai import ChatOpenAI
# Connect to Neo4j
graph = Neo4jGraph(
url="bolt://localhost:7687",
username="neo4j",
password="password",
)
# Auto-refresh schema from the actual graph
graph.refresh_schema()
print(graph.schema) # Shows all node labels and relationship types
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# Build the chain β handles textβCypherβresultsβanswer automatically
chain = GraphCypherQAChain.from_llm(
llm=llm,
graph=graph,
verbose=True, # Shows generated Cypher queries for debugging
validate_cypher=True, # Validates Cypher before running (prevents syntax errors)
return_intermediate_steps=True, # Returns the Cypher query used
)
result = chain.invoke({"query": "Who invested in companies founded by Stanford alumni?"})
print(result["result"]) # Natural language answer
print(result["intermediate_steps"]) # The Cypher query that was generated6. Hybrid RAG: Vectors + Graphs Together
The most powerful pattern combines both: use vector search for semantic similarity (finding relevant documents) and graph traversal for relationship reasoning (finding connected entities):
# Neo4j supports native vector indexes (since v5.11)
# Store chunk embeddings AND entity relationships in the same database:
# 1. Create vector index on Chunk nodes
CREATE VECTOR INDEX chunk_embeddings IF NOT EXISTS
FOR (c:Chunk) ON c.embedding
OPTIONS {indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine'
}}
# 2. Hybrid retrieval: vector search + graph expansion
CALL db.index.vector.queryNodes('chunk_embeddings', 5, $query_embedding)
YIELD node AS chunk, score
WHERE score > 0.8
MATCH (chunk)-[:MENTIONS]->(entity:Entity)
MATCH (entity)-[rel]-(related:Entity)
RETURN chunk.text, entity.name, type(rel), related.name
ORDER BY score DESC
LIMIT 20Frequently Asked Questions
When should I use GraphRAG vs pure vector RAG?
Use GraphRAG when your domain naturally involves networks of relationships: legal (case citations, precedents), medical (drug interactions, patient histories), finance (company ownership chains), and technical documentation (API dependencies). For siloed document Q&A where questions don't span multiple entities, standard vector RAG is simpler and sufficient.
How do I handle incorrect Cypher queries from the LLM?
Use validate_cypher=True in LangChain to catch syntax errors before execution. Wrap Cypher execution in try/except and fall back to semantic vector search if the graph query fails. Log all generated Cypher queries to LangSmith for debugging and iterating on your schema documentation.
Conclusion
GraphRAG shifts AI retrieval from "find similar documents" to "traverse relationships between facts." In domains where entities are connected by complex, multi-hop relationships β legal research, pharmaceutical knowledge bases, enterprise organizational data β pairing Neo4j with vector search provides dramatically higher accuracy than either approach alone. The LangChain integration makes the implementation accessible in under 50 lines of application code.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning β no fluff, just working code and real-world context.