Unstructured.io: Parsing the Unparsable
Dec 30, 2025 • 20 min read
The fastest way to build a bad RAG system is to parse your documents incorrectly. Most developers reach for PyPDF2, pdfplumber, or a simple open(file).read() — and wonder why their AI gives wrong answers. These tools dump raw text without understanding document structure: headers merge with body text, table cells jumble together, multi-column layouts get read left-to-right across both columns, and figures become empty strings. Unstructured.io solves this by running YOLOX computer vision on document images before OCR, giving every text element its semantic category: Title, Header, NarrativeText, Table, ListItem, Image.
1. Why Simple PDF Parsing Fails
- Reads raw character streams from PDF spec
- Multi-column: reads across both columns simultaneously
- Tables: flattened to space-separated text → meaningless
- Headers: merged with adjacent body text
- Scanned PDFs: returns empty string (no OCR)
- Renders PDF to image, runs YOLOX layout detection
- Segments: Title, Header, NarrativeText, Table, Figure
- Tables returned as HTML <table> (relationships preserved)
- Runs Tesseract/Paddle OCR only on detected text blocks
- Works on scanned, photographed, and digital PDFs
2. The Hi-Res Strategy: Computer Vision Parsing
pip install "unstructured[pdf,local-inference]" pytesseract poppler-utils
from unstructured.partition.pdf import partition_pdf
from unstructured.documents.elements import Table, Title, NarrativeText, ListItem
# Strategy options:
# "auto" - Fast, uses pdfminer for digital PDFs (misses tables on complex layouts)
# "fast" - Even faster, best for simple text-only PDFs
# "hi_res" - Slow but accurate: renders to image, runs YOLOX + Tesseract
# "ocr_only" - Forces OCR even on digital PDFs (for consistency)
elements = partition_pdf(
filename="annual_report_2024.pdf",
strategy="hi_res", # YOLOX layout detection + Tesseract OCR
infer_table_structure=True, # Extract tables as HTML
extract_images_in_pdf=True, # Save embedded images to disk
image_output_dir_path="./extracted_images/",
chunking_strategy="by_title", # Group elements under their nearest header
max_characters=4000, # Max chunk size
new_after_n_chars=3000, # Aim for chunks around 3000 chars
combine_under_n_chars=200, # Merge tiny elements (captions, footnotes)
languages=["eng"], # OCR language hint (improves accuracy)
)
# Categorize extracted elements
tables = [el for el in elements if isinstance(el, Table)]
titles = [el for el in elements if isinstance(el, Title)]
text_chunks = [el for el in elements if isinstance(el, NarrativeText)]
print(f"Found: {len(tables)} tables, {len(titles)} titles, {len(text_chunks)} text blocks")
# Each element has rich metadata:
for el in elements[:3]:
print(f"Type: {el.category}")
print(f"Page: {el.metadata.page_number}")
print(f"Text: {el.text[:100]}...")
if hasattr(el.metadata, 'text_as_html') and el.metadata.text_as_html:
print(f"HTML: {el.metadata.text_as_html[:200]}...")3. The Table Problem: Embedding with Context
# Tables are the hardest document element for RAG
# Flattening a financial table to text destroys rows/columns/relationships
#
# Bad approach:
text = "Revenue 2023 450M 2022 380M Growth 18% Operating Income 2023 95M..."
# → LLM has no idea which numbers belong to which year
# Good approach: LLM summarization of structured data
from openai import OpenAI
client = OpenAI()
def process_table_element(table_element) -> dict:
"""Convert table to both text summary (for retrieval) and HTML (for display)."""
html = table_element.metadata.text_as_html
raw_text = table_element.text
# Step 1: Have LLM summarize the table in natural language
summary_response = client.chat.completions.create(
model="gpt-4o-mini", # Cheap model for summarization
messages=[{
"role": "user",
"content": f"Summarize this HTML table in detailed natural language sentences, "
f"preserving all key numbers and relationships:
{html}"
}],
max_tokens=500,
)
summary = summary_response.choices[0].message.content
return {
"embed_text": summary, # Embed the summary for semantic search
"raw_html": html, # Store HTML for display/downstream use
"raw_text": raw_text, # Fallback text
"page": table_element.metadata.page_number,
"source": table_element.metadata.filename,
}
# Process all tables
table_chunks = []
for table in tables:
if len(table.text) > 50: # Skip trivial tables
table_chunks.append(process_table_element(table))
print(f"Processed table on page {table.metadata.page_number}")4. Multi-Modal Parsing: Images and Diagrams
import base64
from pathlib import Path
def caption_extracted_images(image_dir: str, client: OpenAI) -> list[dict]:
"""Generate captions for images extracted from PDFs using GPT-4o vision."""
captions = []
for img_path in Path(image_dir).glob("*.jpg"):
# Read and base64 encode image for Vision API
img_data = base64.b64encode(img_path.read_bytes()).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{img_data}"},
},
{
"type": "text",
"text": "Describe this diagram or figure in detail. If it's a chart, "
"extract all data points. If it's a process diagram, describe the flow. "
"Be specific and technical.",
},
],
}],
max_tokens=800,
)
caption = response.choices[0].message.content
captions.append({
"image_path": str(img_path),
"caption": caption, # Embed caption for retrieval
"embed_text": caption, # What goes into vector DB
})
return captions
image_captions = caption_extracted_images("./extracted_images/", client)
# Build unified RAG chunks from all document elements:
all_chunks = []
# Text chunks (NarrativeText, ListItems, etc.)
for el in elements:
if hasattr(el, 'text') and len(el.text) > 50:
all_chunks.append({
"type": el.category,
"embed_text": el.text,
"metadata": {
"page": getattr(el.metadata, 'page_number', None),
"filename": getattr(el.metadata, 'filename', 'unknown'),
}
})
all_chunks.extend(table_chunks) # Processed table summaries
all_chunks.extend(image_captions) # Image captions
print(f"Total RAG chunks: {len(all_chunks)}")5. Parsing Other Document Types
# Unstructured handles 30+ document formats with the same API:
# PowerPoint (preserves slide structure and speaker notes)
from unstructured.partition.pptx import partition_pptx
elements = partition_pptx("presentation.pptx", include_slide_notes=True)
# Word documents (preserves heading hierarchy)
from unstructured.partition.docx import partition_docx
elements = partition_docx("report.docx", include_metadata=True)
# Email (msg, eml formats — preserves From/To/Subject metadata)
from unstructured.partition.email import partition_email
elements = partition_email("message.eml")
for el in elements:
print(el.metadata.email_message_id, el.category, el.text[:50])
# HTML (preserves semantic structure better than requests + BeautifulSoup)
from unstructured.partition.html import partition_html
elements = partition_html(url="https://example.com/article")
# Auto-detect format
from unstructured.partition.auto import partition
elements = partition("/path/to/unknown_file") # Auto-detects and routes to correct parser
# Unstructured API (cloud-hosted for production, no local install needed)
from unstructured_client import UnstructuredClient
from unstructured_client.models.operations import PartitionRequest
s = UnstructuredClient(api_key_auth="YOUR_API_KEY", server_url="https://api.unstructuredapp.io")
response = s.general.partition(request=PartitionRequest(
partition_parameters={"files": open("document.pdf", "rb"), "strategy": "hi_res"},
))Frequently Asked Questions
How slow is the hi_res strategy?
hi_res is 10-30x slower than fast/auto strategies because it renders every page to a high-resolution image and runs YOLOX inference. Expect 5-30 seconds per page for complex documents. For a 200-page annual report, this means 15-60 minutes. This is acceptable for offline document ingestion (run once, store results). For real-time document upload (user uploads and expects immediate results), pre-run ingestion in the background or use the Unstructured cloud API which has GPU-accelerated hi_res processing at ~1-3 seconds per page.
What about scanned PDFs vs digital PDFs?
Digital PDFs (created by Word, LaTeX, or PDF printers) have embedded text that can be extracted with fast/auto strategies. Scanned PDFs (physical document scanned to PDF) have no text — only images — so they require OCR. Unstructured automatically detects this and applies OCR when needed with hi_res. For best OCR accuracy on scanned documents, improve image quality before parsing: 300+ DPI, deskewing, and contrast normalization significantly improve Tesseract performance.
Conclusion
Document parsing quality is often the biggest bottleneck in RAG system accuracy — not the embedding model or the vector database. Unstructured's hi_res strategy with YOLOX layout detection and table extraction reduces document parsing errors by 60-80% compared to simple text extraction for complex documents. For production pipelines processing PDFs, PowerPoint, and Word documents, the investment in proper parsing pays dividends in answer quality. Use table summarization with a cheap LLM to preserve relational structure, and caption images with a vision model to make diagrams retrievable.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.