- Respects natural document boundaries (paragraphs, sections)
- Maintains semantic relationships between segments
- Optimizes chunk size for LLM processing
Defaults
target_length
: 4096tokenizer
:Cl100kBase
Tokenizer
Chunkr supports a large number of tokenizers. You can use our predefined ones or specify any tokenizer from huggingface.Predefined Tokenizers
The predefined tokenizers are enum values and can be used as follows:Word
: Split by wordsCl100kBase
: For OpenAI models (e.g. GPT-3.5, GPT-4, text-embedding-ada-002)XlmRobertaBase
: For RoBERTa-based multilingual modelsBertBaseUncased
: BERT base uncased tokenizer
Hugging Face Tokenizers
Use any Hugging Face tokenizer by providing its model ID as a string (e.g. “facebook/bart-large”, “Qwen/Qwen-tokenizer”, etc.)Calculating Chunk Lengths With Embed Fields
When calculating chunk lengths and performing tokenization, we use the text from theembed
field in each chunk and segment object. This field contains the text that will be compared against the target length.
The content of the embed
field is automatically calculated based on whether a description is generated for segments:
- If
description
is set totrue
for a segment type, the embed field will include both thecontent
and the generateddescription
- If
description
is set tofalse
or not specified, the embed field will contain only thecontent
- Picture segments: Length will be based on both the content and generated description (since
description: true
) - Table segments: Length will be based on both the content and generated description (since
description: true
is the default) - All other segments: Length will be based only on the content (since
description
is not specified) - The tokenizer will be
CL100K_BASE
(default)
How Segments Become Chunks
Understanding how individual segments are combined into chunks is crucial for optimizing your chunking strategy. Here’s how the process works:Segment to Chunk Flow
- Individual segments are created from document layout analysis
- Each segment gets its own
embed
field (content only, or content + description) - Segments are grouped together based on the target chunk length
- All segment
embed
fields in a group are concatenated to form the chunk’sembed
field - The
chunk_length
represents the tokenized length of this concatenated content
Visual Example
The following diagram illustrates how segments are combined into chunks, showing how embed fields are concatenated and lengths are calculated: In this example:- Segments 1 & 3 have only content in their embed fields (no description enabled)
- Segments 2 & 4 have content + description in their embed fields (description enabled)
- Chunk 1 combines Segments 1 & 2, concatenating their embed fields
- Chunk 2 combines Segments 3 & 4, concatenating their embed fields
- The final chunk lengths (130 tokens each) reflect the total tokenized length of the concatenated embed content