Oops! Something went wrong while submitting the form.
We have an entire series on AI-ready data. Read the previous post for a quick breakdown of the RAG pipeline diagram and learn more about Unstructured AI, an ETL layer designed to help you process unstructured documents.
In Retrieval Augmented Generation (RAG), document chunking is essential for optimizing information retrieval and generation.
Dividing documents into manageable segments allows systems to efficiently access relevant data, improve contextual understanding, and improve response accuracy.
Below, we’ll show you how to chunk documents for RAG and how it can improve your large language model’s capabilities.
Key Takeaways
Chunking divides documents into smaller, meaningful segments, improving retrieval and response accuracy in RAG systems.
By preserving context and reducing information overload, chunking enhances the RAG system’s ability to provide relevant, precise outputs.
Common chunking strategies include fixed-size, sentence-based, paragraph-based, and semantic-based chunks, ensuring optimal performance for various tasks.
Each chunk can be enriched with metadata and converted into numerical representations (embeddings) for improved semantic understanding.
Overlapping segments maintain context between chunks, boosting the likelihood of retrieving relevant information and improving system performance.
Unstructured AI efficiently handles various file types, providing intelligent chunking, table extraction, and hierarchical text retention for seamless integration into RAG systems.
What Is Chunking for RAG?
Chunking for RAG is a process of breaking down large documents into smaller segments called chunks.
Each chunk maintains contextual integrity, ensuring the retrieved information is meaningful and directly applicable to the generation task. Therefore, effective chunking improves the RAG’s ability to facilitate integration between retrieval and generation.
Chunking also improves RAG performance and leads to more accurate and contextually relevant outputs.
Why Is Chunking Documents for RAG Important?
Chunking documents allows the RAG system to locate and access relevant data in response to user’s queries.
Document chunking is also an important segmentation process that ensures each chunk maintains its contextual integrity, which leads to more accurate and meaningful generated responses.
Chunking also reduces processing time and memory usage, ensuring faster and more scalable operations.
How to Chunk Documents for RAG
To chunk documents for RAG, we rely on a 8-step process which includes:
Document ingestion
Text extraction
Text cleaning
Chunking
Overlap
Metadata
Embedding
Indexing
Document Ingestion
Goal: To establish a solid foundation for your RAG system, and facilitate efficient and accurate information retrieval.
The first step in setting up a RAG system involves gathering all documents you intend to use. You can begin by collecting a wide range of relevant texts such as articles, books, and reports.
Once you’ve assembled the documents, organize them into a designated folder or database to ensure the RAG system can easily access them.
To ensure compatibility, check and note the various file formats you have, such as:
PDFs
Word documents
Plain text files
Web pages
If you have an extensive document collection, prioritize the documents to determine which ones should be processed first based on their importance or relevance.
We highly recommend performing deduplication to identify and remove any duplicate documents to maintain efficiency and avoid redundancy.
Also, use the most up-to-date document versions if there are multiple iterations.
Collect metadata. Provide information about each document, such as title, author, creation date, and source. This ensures better organization and retrieval later on in the process.
Address access rights. Verify which documents are confidential, set appropriate permissions, and ensure the system only uses authorized information.
Text Extraction
Goal: Extract raw data and ensure it is well-structured.
The next step of the document chunking process is extracting raw data from your collected documents to ensure the information is ready for processing.
Begin the process by handling different file formats using appropriate tools.
For example, you can use libraries like PyPDF2 or pdfminer for PDFs and python-docx for processing Word documents.
BeautifulSoup or other web scraping utilities are ideal for retrieving necessary information from web pages.
The goal of text extraction is to extract the main content of each document while ignoring elements like headers, footers, page numbers, and tables of contents to maintain relevance and clarity.
However, if some of these elements add contextual value, we recommend including them.
For example, if a document contains images or tables, you can:
1) decide to ignore it,
2) extract any embedded text (like captions and table contents), or
3) retain a placeholder indicating their original position within the document.
We find tools like PDFPlumber ideal for capturing table data from complex tables.
Preserving the structure of the original document is a priority. Ensure that paragraph breaks, section titles, and list items are maintained to retain the document’s organization and readability.
Additionally, it’s recommended to standardize the character encoding to UTF-8, as it helps minimize issues with special characters or multiple languages, while ensuring consistency across extracted text.
Text Cleaning
Goal: To prepare data for high-quality input into your RAG system.
The next step is tidying upthe extracted raw text to prepare it for chunking.
We’ve broken down this process into three stages:
1. Text Cleaning: First Stage
First, remove extra whitespace, including multiple spaces, tabs, and unnecessary line breaks, to make your data more readable.
We also recommend you deal with any special characters and decide which ones to keep and which to remove. For example, punctuation marks are often necessary, while symbols or emojis may be irrelevant to the task at hand.
2. Text Cleaning: Second Stage
The next part is standardizing theformatting of the text.
This often includes converting all text to lowercase unless case sensitivity is important for the content.
It’s also important to ensure consistent use of quotation marks and apostrophes throughout the document.
When it comes to numbers, decide whether to keep them as-is, convert them to words, or remove them entirely based on your specific needs.
Remove boilerplate text. Phrases that don’t add value, such as “All rights reserved” or “Table of contents,” are recommended to be deleted to streamline the content.
We also recommend correcting common errors like spelling mistakes.
If necessary, anonymize the text by removing names, addresses, and other private information.
3. Text Cleaning: Third Stage
At the last stage, we typically normalize the text. This can include anything from expanding contractions to standardizing dates, formats, and measurements.
When working with multilingual documents, we recommend performing language detection to identify the language of each text segment.
4. Text Cleaning: Fourth Stage
Sentence segmentation is also a critical part of the process, where you can clearly mark the beginning and end of sentences to prepare for chunking.
Finally, remove any irrelevant sections of the text that don’t contribute to your objectives.
This could include reference lists, appendices, or other sections that aren’t necessary for analysis.
By the end of this process, your text should be clean, standardized, and ready for chunking.
Chunking
Goal: To enable more efficient processing, improved relevance in retrieval, and better handling of context windows in language models.
Chunking refers to breaking down the clean text into smaller, manageable chunks.
This step is crucial for RAG systems, as it determines how effectively information will be retrieved. Selecting the right chunking strategy is usually key to maximizing effectiveness.
Here are different chunking strategies to choose from:
Fixed-size chunks - splitting the text into chunks of a set number of tokens (words or characters).
Paragraph-based chunks - using natural paragraph breaks as chunk boundaries.
Semantic chunking - keeping related ideas together in one chunk.
There are also advanced chunking techniques like recursive chunking.
“After choosing the right chunking method, you need to determine the desired chunk size.”
Chunks must be small enough to be specific but large enough to retain context. Common chunk sizes range from 100 to 1,000 tokens, depending on the task at hand.
Other parts of the process include:
Handling edge cases. For example, very short paragraphs or sentences might be combined, while very long paragraphs could be split into smaller chunks.
Preserving context. This may involve including section titles or metadata with each chunk. In some cases, overlapping chunks are used to ensure continuity.
Maintaining coherence. This implies avoiding splitting sentences or phrases in the middle and adjusting chunk boundaries if necessary.
Handling special content, such as lists or tables, which might require unique rules.
Each chunk is assigned a unique identifier, and the track of which document each chunk came from and its position in that document. We recommend experimenting with different methods to see what works best for your use case.
Overlapping
Goal: To ensure context continuity.
Overlapping, a step closely related to chunking, refers to the practice of having chunks share some content with their neighboring chunks.
Chunk overlap helps:
Maintain context between chunks
Reduce the chance of splitting important information
Improve retrieval by increasing the chances of finding relevant information
To implement overlap, you need to include some content from the end of the previous chunk when creating a new one. Typically, overlap constitutes 10-20% of the chunk size.
There are 3 different types of overlap:
Token overlap - involves repeating a set number of words or characters.
Sentence overlap - one or more full sentences are shared between chunks.
Sliding window - moves progressively through the text and creates a substantial overlap.
One thing to note, though, is that overlap will increase your storageneeds and processing time. Also, excessive overlap may lead to redundant information during retrieval.
For a fixed chunk size, you can reuse the last 50 tokens of one chunk in the next.
For sentence-based chunks, you can include the final sentence of the previous chunk.
For paragraph-based chunks, you can add a summary or topic sentence from the preceding paragraph.
Handling special cases, such as document boundaries or lists and tables, may require unique overall rules. We recommend adjusting the amount of overlap through experimentation or dynamically based on the content type and importance to find the optimal balance for your use case.
Metadata
Goal: To enable better tracking and debugging, ultimately improving the system’s performance.
In this step, you can attach additional information or metadata to each chunk. Metadata helps track chunks and provides valuable context during retrieval and generation.
Common types of metadata include:
Source document
Chunk position
A unique chunk ID
Creation time
Section or chapter reference
Author
Topic or category
Adding this data makes organizing and searching chunks more efficient and provides context for the RAG systems. It allows for filtering or prioritizing chunks during retrieval and helps track the origin of the information.
To add metadata, you may create a separate database or index or embed it directly within the chunk text. A structured format like JSON is typically used for consistency.
Also, special care is required to ensure metadata is standardized and updated when source documents change, while sensitive information is carefully excluded to maintain privacy.
Embedding
Goal: To enable efficient and semantically meaningful retrieval of relevant information from large datasets and enhance the accuracy and contextual relevance of generated responses.
This step involves converting text chunks into numerical representations known as embeddings.
Embeddings allow machines to process and understand the semantic meaning of the text. They help compare chunks based on their meaning rather than just their wording.
You can create embeddings by feeding each text chunk through a pre-trained language model, such as specialized models like text-embedding-3-small.
When choosing an embedding model, we recommend considering its accuracy, speed, and resource requirements.
Popular models from OpenAI, Google, or open-source alternatives are common choices, depending on the use case and available resources.
The process involves preparing the text, passing it through the model, and storing the resulting vector along with its metadata. For very long chunks, you may need to split them, as models often have input length limits.
To ensure consistency, we recommend using the same embedding model for all chunks and tracking the version used. We also recommend conducting quality checks to validate that the embeddings accurately represent the text and detect any anomalies.
Indexing
Goal: To enable quick similarity searches, efficiently organize data, and scale to large datasets.
The final step of document chunking is indexing, which refers to organizing and storing your chunks, along with their metadata and embeddings.
To achieve this, you should choose an index structure, typically using vector databases like Pinecone, Weaviate, or Milvus.
Your index should include:
the embedding vectors,
the original text chunks,
and their associated metadata.
The process involves setting up the indexing system, defining the index schema, and inserting each chunk’s embedding text and metadata. Once data is in, you can build index structures for efficient searching.
Some of the key considerations include:
Scalability
Update frequency
Query speed
Regular maintenance is required to ensure the index remains up-to-date, but we also recommend testing the index by performing sample queries to check retrieval accuracy and speed.
It’s important to think about security measures, like access controls and data backups, to ensure data integrity and safety.
This indexing step is crucial for making the RAG system fast and accurate, enabling it tofind the most relevant chunks during user query processing.
How We Chunk Documents With Unstructured AI
Unstructured AI serves as a sophisticated Extract, Transform, Load (ETL) layer designed to process complex and unstructured document formats for RAG architectures or downstream GenAI applications.
It supports a wide range of file types, including:
PowerPoint
Excel
CSV
HTML
DOC
DOCX and more
Unstructured AI converts these document formats into structured outputs for advanced AI applications.
Its key features include:
Intelligent chunking
Table extraction
Chart processing
Hierarchical text retention
After structuring the unstructured documents, our solution outputs documents in:
Structured text (in JSON format)
Tables and charts (in CSV and Excel format)
Rich metadata (including titles, dates, and page counts)
In doing so, Unstructured AI helps us:
Process semi-structured reports;
Convert tables into CSV/Excel formats and extract charts as PNGs with semantic descriptions;
Use intelligent chunking feature to organize text semantically while retaining hierarchical structure;
Create and store embeddings in a vector database;
Downstream RAG or GenAI to help retrieve relevant information;
Let Us Help You Chunk Your Documents for RAG
Need help taking advantage of your data, chunking your documents for RAG, and augmenting your large language model? Please schedule a free 30-minute call with our experts. We can discuss your needs and show you how Unstructured AI and our other AI solutions work live.