- Respects natural document boundaries (paragraphs, sections)
- Maintains semantic relationships between segments
- Optimizes chunk size for LLM processing
Defaults
target_length: 4096tokenizer:Cl100kBase
Tokenizer
Chunkr supports a large number of tokenizers. You can use our predefined ones or specify any tokenizer from huggingface.Predefined Tokenizers
The predefined tokenizers are enum values and can be used as follows:Word: Split by wordsCl100kBase: For OpenAI models (e.g. GPT-3.5, GPT-4, text-embedding-ada-002)XlmRobertaBase: For RoBERTa-based multilingual modelsBertBaseUncased: BERT base uncased tokenizer
Hugging Face Tokenizers
Use any Hugging Face tokenizer by providing its model ID as a string (e.g. “facebook/bart-large”, “Qwen/Qwen-tokenizer”, etc.)Calculating Chunk Lengths With Embed Fields
When calculating chunk lengths and performing tokenization, we use the text from theembed field in each chunk and segment object. This field contains the text that will be compared against the target length.
The content of the embed field is automatically calculated based on whether a description is generated for segments:
- If
descriptionis set totruefor a segment type, the embed field will include both thecontentand the generateddescription - If
descriptionis set tofalseor not specified, the embed field will contain only thecontent
- Picture segments: Length will be based on both the content and generated description (since
description: true) - Table segments: Length will be based on both the content and generated description (since
description: trueis the default) - All other segments: Length will be based only on the content (since
descriptionis not specified) - The tokenizer will be
CL100K_BASE(default)
How Segments Become Chunks
Understanding how individual segments are combined into chunks is crucial for optimizing your chunking strategy. Here’s how the process works:Segment to Chunk Flow
- Individual segments are created from document layout analysis
- Each segment gets its own
embedfield (content only, or content + description) - Segments are grouped together based on the target chunk length
- All segment
embedfields in a group are concatenated to form the chunk’sembedfield - The
chunk_lengthrepresents the tokenized length of this concatenated content
Visual Example
The following diagram illustrates how segments are combined into chunks, showing how embed fields are concatenated and lengths are calculated: In this example:- Segments 1 & 3 have only content in their embed fields (no description enabled)
- Segments 2 & 4 have content + description in their embed fields (description enabled)
- Chunk 1 combines Segments 1 & 2, concatenating their embed fields
- Chunk 2 combines Segments 3 & 4, concatenating their embed fields
- The final chunk lengths (130 tokens each) reflect the total tokenized length of the concatenated embed content