Chunking

Chunking is the process of splitting a document into smaller segments. These chunks can be used for semantic search, and better LLM performance. By leveraging layout analysis, we create intelligent chunks that preserve document structure and context. Our algorithm:

Respects natural document boundaries (paragraphs, sections)
Maintains semantic relationships between segments
Optimizes chunk size for LLM processing

Here is an example that will chunk the document into 4096 tokens per chunks for OpenAI models. These values are also the defaults, so you don’t need to specify them.

from chunkr_ai import Chunkr
from chunkr_ai.models import (
    ChunkProcessing,
    Configuration
    Tokenizer,
)

chunkr = Chunkr()

chunkr.upload("path/to/file", Configuration(
chunk_processing=ChunkProcessing(
    target_length=Cl100kBase,
    tokenizer=Tokenizer.CL100K_BASE
  ),
))

Defaults

target_length: 4096
tokenizer: Cl100kBase

Tokenizer

Chunkr supports a large number of tokenizers. You can use our predefined ones or specify any tokenizer from huggingface.

Predefined Tokenizers

The predefined tokenizers are enum values and can be used as follows:

from chunkr_ai import Chunkr
from chunkr_ai.models import (
    ChunkProcessing,
    Configuration
    Tokenizer,
)

chunkr = Chunkr()

chunkr.upload("path/to/file", Configuration(
  chunk_processing=ChunkProcessing(
    tokenizer=Tokenizer.WORD
  ),
))

Available options:

Word: Split by words
Cl100kBase: For OpenAI models (e.g. GPT-3.5, GPT-4, text-embedding-ada-002)
XlmRobertaBase: For RoBERTa-based multilingual models
BertBaseUncased: BERT base uncased tokenizer

Hugging Face Tokenizers

Use any Hugging Face tokenizer by providing its model ID as a string (e.g. “facebook/bart-large”, “Qwen/Qwen-tokenizer”, etc.)

from chunkr_ai import Chunkr
from chunkr_ai.models import (
    ChunkProcessing,
    Configuration
    Tokenizer,
)

chunkr = Chunkr()

chunkr.upload("path/to/file", Configuration(
  chunk_processing=ChunkProcessing(
    tokenizer="Qwen/Qwen-tokenizer"
  ),
))

Calculating Chunk Lengths With Embed Fields

When calculating chunk lengths and performing tokenization, we use the text from the embed field in each chunk and segment object. This field contains the text that will be compared against the target length. The content of the embed field is automatically calculated based on whether a description is generated for segments:

If description is set to true for a segment type, the embed field will include both the content and the generated description
If description is set to false or not specified, the embed field will contain only the content

Here’s an example of enabling descriptions for Picture segments, which will include both the content and description in the embed field when calculating lengths:

from chunkr_ai import Chunkr
from chunkr_ai.models import (
    Configuration,
    SegmentProcessing,
    GenerationConfig,
)

chunkr = Chunkr()

chunkr.upload("path/to/file", Configuration(
    segment_processing=SegmentProcessing(
        Picture=GenerationConfig(
            description=True
        )
    )
))

This means for this configuration, when calculating chunk lengths:

Picture segments: Length will be based on both the content and generated description (since description: true)
Table segments: Length will be based on both the content and generated description (since description: true is the default)
All other segments: Length will be based only on the content (since description is not specified)
The tokenizer will be CL100K_BASE (default)

The automatic embed field calculation ensures optimal chunking while preserving important descriptive information when available.

How Segments Become Chunks

Understanding how individual segments are combined into chunks is crucial for optimizing your chunking strategy. Here’s how the process works:

Segment to Chunk Flow

Individual segments are created from document layout analysis
Each segment gets its own embed field (content only, or content + description)
Segments are grouped together based on the target chunk length
All segment embed fields in a group are concatenated to form the chunk’s embed field
The chunk_length represents the tokenized length of this concatenated content

This approach ensures that chunk lengths accurately reflect all the text content (including any generated descriptions) that will be used for semantic search and LLM processing.

Visual Example

The following diagram illustrates how segments are combined into chunks, showing how embed fields are concatenated and lengths are calculated: In this example:

Segments 1 & 3 have only content in their embed fields (no description enabled)
Segments 2 & 4 have content + description in their embed fields (description enabled)
Chunk 1 combines Segments 1 & 2, concatenating their embed fields
Chunk 2 combines Segments 3 & 4, concatenating their embed fields
The final chunk lengths (130 tokens each) reflect the total tokenized length of the concatenated embed content

Get Started

Features

Excel Parser

Webhooks

Use Cases

Self Hosting

Defaults

Tokenizer

Predefined Tokenizers

Hugging Face Tokenizers

Calculating Chunk Lengths With Embed Fields

How Segments Become Chunks

Segment to Chunk Flow

Visual Example

Get Started

Features

Excel Parser

Webhooks

Use Cases

Self Hosting

​Defaults

​Tokenizer

​Predefined Tokenizers

​Hugging Face Tokenizers

​Calculating Chunk Lengths With Embed Fields

​How Segments Become Chunks

​Segment to Chunk Flow

​Visual Example

Defaults

Tokenizer

Predefined Tokenizers

Hugging Face Tokenizers

Calculating Chunk Lengths With Embed Fields

How Segments Become Chunks

Segment to Chunk Flow

Visual Example