Chunking is an important step in LLM pipelines to break the knowledge base down to fit inside the context window. Chunking is as much an art as a science. The best strategy will depend on your use case. How well you do it will influence the performance of your [[RAG]] system.
A good chunking strategy
- Is semantically coherent (not just length-based)
- Preserves structure, especially for tables/lists
- Respects Markdown boundaries like # Heading, tables, and bullet points
- Is token-efficient (ideally < 1500 tokens per chunk for embedding)
Three broad options exist for chunking
1. **Splitting**: split the document based on a pre-defined character length, usually with overlap (a sliding window approach) to preserve context. Sometimes the length is random.
2. **Recursive** (hierarchical, context-aware): recursively chunk into smaller and smaller pieces based on text or document structure (e.g., newlines, markdown headings) until all chunks meet max size requirements.
3. **Semantic**: use understanding of the structure of the document to split based on semantic groups, or where there is a significant change in meaning. Use embeddings to compare meaning for each new sentence (or similar text unit); create a new chunk when embeddings diverge significantly (see Greg Kamradt's [example](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)). Sometimes powered by ML models.
## document chunking
With complex documents like PDFs that might have text in complex layouts (e.g., sidebars, text boxes, 2-column formats), tables, charts and images, a more complex chunking strategy is needed. The first step is to [[parse PDF]] to a representation that can be chunked. Then use the metadata for document elements provided by the parser to distinguish types of elements and deal with each.
A common practice for tables is to summarize the table after you've extracted it. Then get an embedding of that summary. If the summary embedding matches what you're looking for, then pass the raw table to your LLM.
Documents with code chunks require similar care as you want to keep code blocks together. Use a [[LangChain]] `RecursiveCharacterTextSplitter` with correct [splitters](https://github.com/langchain-ai/langchain/blob/9ef2feb6747f5a69d186bd623b569ad722829a5e/libs/langchain/langchain/text_splitter.py#L1069).
## metadata in chunking
You can attach metadata to each chunk (e.g., page number, document source, parent chunk for hierarchical strategies). This becomes a JSON blob where the text is one property along with the metadata. This approach helps with later citing sources of the retrieved chunks used.
## advanced chunking
Advanced chunking options incorporate more than just the text itself (multi-vector indexing) or represent chunks in a graph (graph based chunking).
Multi-vector indexing can include for example a summary of the chunk, hypothetical questions that the chunk might answer (to match with during inference), and parent chunks (called Parent Document Retriever--the idea is to match the search first with smaller chunks then pull in the parent chunks for context).
Graph based chunking represents the chunks in a graph structure. Neo4j has an implementation of this in their [LLM Knowledge Graph Builder](https://llm-graph-builder.neo4jlabs.com/) product (see [explainer](https://neo4j.com/blog/genai/graphrag-manifesto/)). LangChain integrates [Diffbot](https://www.diffbot.com/) ([InstaGraph](https://github.com/yoheinakajima/instagraph) is an alternative). You can also build the graph yourself with libraries like `networkx`.
## implementations
> [!Tip]- Additional Resources
> - [StackOverflow: Breaking up is hard to do: Chunking in RAG applications](https://stackoverflow.blog/2024/06/06/breaking-up-is-hard-to-do-chunking-in-rag-applications)
> - [Unstructured: Chunking for RAG: best practices](https://unstructured.io/blog/chunking-for-rag-best-practices)
> - [LangChain: Chunking Strategies](https://js.langchain.com/docs/concepts/text_splitters/)
> - [LlamaIndex: Chunking Strategies](https://docs.llamaindex.ai/en/stable/optimizing/basic_strategies/basic_strategies/)
> - [MachineLearning Code: Optimizing RAG with Document Chunking Techniques Using Python](https://github.com/xbeat/Machine-Learning/blob/main/Optimizing%20RAG%20with%20Document%20Chunking%20Techniques%20Using%20Python.md)
> - [5 Levels of Text Splitting (Notebook)](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)
> - [Mistral AI Basic RAG tutorial](https://docs.mistral.ai/guides/rag/)