Chunking

Chunking is the process of splitting a document into smaller segments. These chunks can be used for semantic search, and better LLM performance.

By leveraging layout analysis, we create intelligent chunks that preserve document structure and context. Our algorithm:

Respects natural document boundaries (paragraphs, sections)
Maintains semantic relationships between segments
Optimizes chunk size for LLM processing

You can review the implementation of our chunking algorithm in our GitHub repository.

Here is an example that will chunk the document into 512 words per chunks. These values are also the defaults, so you don’t need to specify them.

from chunkr_ai import Chunkr
from chunkr_ai.models import (
    ChunkProcessing,
    Configuration
    Tokenizer,
)

chunkr = Chunkr()

chunkr.upload("path/to/file", Configuration(
chunk_processing=ChunkProcessing(
ignore_headers_and_footers=True,
target_length=512,
tokenizer=Tokenizer.WORD
),
))

Defaults

ignore_headers_and_footers: True
target_length: 512
tokenizer: Word

Tokenizer

Chunkr supports a large number of tokenizers. You can use our predefined ones or specify any tokenizer from huggingface.

Predefined Tokenizers

The predefined tokenizers are enum values and can be used as follows:

from chunkr_ai import Chunkr
from chunkr_ai.models import (
    ChunkProcessing,
    Configuration
    Tokenizer,
)

chunkr = Chunkr()

chunkr.upload("path/to/file", Configuration(
chunk_processing=ChunkProcessing(
tokenizer=Tokenizer.CL100K_BASE
),
))

Available options:

Word: Split by words
Cl100kBase: For OpenAI models (e.g. GPT-3.5, GPT-4, text-embedding-ada-002)
XlmRobertaBase: For RoBERTa-based multilingual models
BertBaseUncased: BERT base uncased tokenizer

You can also define the tokenizer enum as a string in the python SDK. Here is an example where the string will be converted to the enum value.

Python

from chunkr_ai import Chunkr
from chunkr_ai.models import (
    ChunkProcessing,
    Configuration
    Tokenizer,
)

chunkr = Chunkr()

chunkr.upload("path/to/file", Configuration(
    chunk_processing=ChunkProcessing(
            tokenizer="Word"
    ),
))

Hugging Face Tokenizers

Use any Hugging Face tokenizer by providing its model ID as a string (e.g. “facebook/bart-large”, “Qwen/Qwen-tokenizer”, etc.)

from chunkr_ai import Chunkr
from chunkr_ai.models import (
    ChunkProcessing,
    Configuration
    Tokenizer,
)

chunkr = Chunkr()

chunkr.upload("path/to/file", Configuration(
chunk_processing=ChunkProcessing(
tokenizer="Qwen/Qwen-tokenizer"
),
))

Calculating Chunk Lengths With Embed Sources

When calculating chunk lengths and performing tokenization, we use the text from the embed field in each chunk object. This field contains the text that will be compared against the target length. You can configure what text goes into the embed field by setting the embed_sources parameter in your segment processing configuration. This parameter is specified under segment_processing.{segment_type} in your configuration. You can see more information about the embed_sources parameter in the Segment Processing section.

Here’s an example of customizing the embed field content for Picture segments. By configuring embed_sources, you can include both the LLM-generated output and the generated content in the embed field for Pictures, while other segment types will continue using just the default generated content. Additionally, we can use CL100K_BASE tokenizer to configure this for OpenAI models.

This means for this configuration, when calculating chunk lengths:

Picture segments: Length will be based on both the LLM summary and generated content
All other segments: Length will be based only on the generated content
The tokenizer will be CL100K_BASE

from chunkr_ai import Chunkr
from chunkr_ai.models import (
    ChunkProcessing,
    Configuration,
    SegmentProcessing,
    GenerationConfig,
    GenerationStrategy,
    SegmentFormat,
    EmbedSource,
    Tokenizer,
)

chunkr = Chunkr()

chunkr.upload("path/to/file", Configuration(
chunk_processing=ChunkProcessing(
tokenizer=Tokenizer.CL100K_BASE
),
segment_processing=SegmentProcessing(
Picture=GenerationConfig(
format=SegmentFormat.MARKDOWN,
strategy=GenerationStrategy.LLM,
llm="Summarize the key information presented",
embed_sources=[EmbedSource.CONTENT, EmbedSource.LLM]
)
)
))

By combining the embed_sources parameter with the tokenizer parameter, you can customize the chunk lengths and tokenization for different segment types. This allows you to have very powerful chunking configurations for your documents.

Get Started

Features

Use Cases

Self Hosting

Defaults

Tokenizer

Predefined Tokenizers

Hugging Face Tokenizers

Calculating Chunk Lengths With Embed Sources

Get Started

Features

Use Cases

Self Hosting

​Defaults

​Tokenizer

​Predefined Tokenizers

​Hugging Face Tokenizers

​Calculating Chunk Lengths With Embed Sources

Defaults

Tokenizer

Predefined Tokenizers

Hugging Face Tokenizers

Calculating Chunk Lengths With Embed Sources