opncrafter
🗄️

Data Engineering

The data pipelines that feed better AI systems.

Every AI system is only as good as the data behind it. Building LLM applications requires data engineering skills that most ML tutorials skip: ingesting messy documents from PDFs and PowerPoints, cleaning and deduplicating training data at scale, generating synthetic datasets to fine-tune smaller models, and versioning datasets so experiments are reproducible.

Unstructured.io has become the standard for parsing complex document formats for RAG — handling PDFs with tables, images, and mixed formatting that breaks simpler parsers. DVC (Data Version Control) brings Git-style versioning to large datasets. Argilla provides a human annotation layer for building high-quality RLHF training data. For RAG specifically, how you chunk your documents has an enormous impact on retrieval accuracy — this track covers recursive, semantic, and agentic chunking strategies.

If you're building a production RAG system or fine-tuning a model for a specific domain, half the work is in the data. This track covers the tools and techniques that make your data pipeline reliable, reproducible, and high-quality.

📚 Learning Path

  1. Unstructured.io: parsing PDFs, HTML, and PPTX
  2. Data Version Control (DVC) for reproducibility
  3. Advanced chunking: recursive and semantic strategies
  4. Synthetic data generation with GPT-4
  5. Argilla for RLHF annotation and DPO data

11 Guides in This Track

Synthetic Data Pipelines

Using GPT-4 to generate training data for small models.

Read Guide →

Unstructured.io ETL

Parsing PDFs, PPTx, and HTML for RAG.

Read Guide →

Argilla for RLHF

Data labeling for DPO and fine-tuning.

Read Guide →

Feature Stores for RAG

Using Feast for real-time personalization.

Read Guide →

Data Cleaning & Dedupe

MinHash and PII scrubbing at scale.

Read Guide →

Knowledge Graph Construction

Automating graph builds with LLMs.

Read Guide →

Data Version Control

Git for large datasets.

Read Guide →

HNSW Indexing

Math behind Vector Search.

Read Guide →

Advanced Chunking

Recursive vs Semantic splitting.

Read Guide →

AI Governance

EU AI Act compliance.

Read Guide →

Multi-Modal Datasets

Storing Image+Text pairs.

Read Guide →
← Browse all topics