Data Engineering
The data pipelines that feed better AI systems.
Every AI system is only as good as the data behind it. Building LLM applications requires data engineering skills that most ML tutorials skip: ingesting messy documents from PDFs and PowerPoints, cleaning and deduplicating training data at scale, generating synthetic datasets to fine-tune smaller models, and versioning datasets so experiments are reproducible.
Unstructured.io has become the standard for parsing complex document formats for RAG — handling PDFs with tables, images, and mixed formatting that breaks simpler parsers. DVC (Data Version Control) brings Git-style versioning to large datasets. Argilla provides a human annotation layer for building high-quality RLHF training data. For RAG specifically, how you chunk your documents has an enormous impact on retrieval accuracy — this track covers recursive, semantic, and agentic chunking strategies.
If you're building a production RAG system or fine-tuning a model for a specific domain, half the work is in the data. This track covers the tools and techniques that make your data pipeline reliable, reproducible, and high-quality.
📚 Learning Path
- Unstructured.io: parsing PDFs, HTML, and PPTX
- Data Version Control (DVC) for reproducibility
- Advanced chunking: recursive and semantic strategies
- Synthetic data generation with GPT-4
- Argilla for RLHF annotation and DPO data