Chunking
Chunking
Chunking is the process of splitting a document into smaller segments. These chunks can be used for semantic search, and better LLM performance.
By leveraging layout analysis, we create intelligent chunks that preserve document structure and context. Our algorithm:
- Respects natural document boundaries (paragraphs, sections)
- Maintains semantic relationships between segments
- Optimizes chunk size for LLM processing
You can review the implementation of our chunking algorithm in our GitHub repository.
Here is an example that will chunk the document into 512 words per chunks. These values are also the defaults, so you don’t need to specify them.
Defaults
ignore_headers_and_footers
: Truetarget_length
: 512tokenizer
:Word
Tokenizer
Chunkr supports a large number of tokenizers. You can use our predefined ones or specify any tokenizer from huggingface.
Predefined Tokenizers
The predefined tokenizers are enum values and can be used as follows:
Available options:
Word
: Split by wordsCl100kBase
: For OpenAI models (e.g. GPT-3.5, GPT-4, text-embedding-ada-002)XlmRobertaBase
: For RoBERTa-based multilingual modelsBertBaseUncased
: BERT base uncased tokenizer
You can also define the tokenizer enum as a string in the python SDK. Here is an example where the string will be converted to the enum value.
Hugging Face Tokenizers
Use any Hugging Face tokenizer by providing its model ID as a string (e.g. “facebook/bart-large”, “Qwen/Qwen-tokenizer”, etc.)
Calculating Chunk Lengths With Embed Sources
When calculating chunk lengths and performing tokenization, we use the text from the embed
field in each chunk object. This field contains the text that will be compared against the target length.
You can configure what text goes into the embed
field by setting the embed_sources
parameter in your segment processing configuration. This parameter is specified under segment_processing.{segment_type}
in your configuration.
You can see more information about the embed_sources
parameter in the Segment Processing section.
Here’s an example of customizing the embed
field content for Picture segments. By configuring embed_sources
, you can include both the LLM-generated output and Chunkr’s markdown output in the embed
field for Pictures, while other segment types will continue using just the default Markdown content.
Additionally, we can use CL100K_BASE
tokenizer to configure this for OpenAI models.
This means for this configuration, when calculating chunk lengths:
- Picture segments: Length will be based on both the LLM summary and Markdown content
- All other segments: Length will be based only on the Markdown content
- The tokenizer will be
CL100K_BASE
By combining the embed_sources
parameter with the tokenizer
parameter, you can customize the chunk lengths and tokenization for different segment types.
This allows you to have very powerful chunking configurations for your documents.
Was this page helpful?