Chunkr processes files by converting them into chunks, where each chunk contains a list of segments. This basic unit allows our API to be very flexible. See more information in the Layout Analysis section.

After the segments are identified you can easily configure many post-processing capabilities. You can use our defaults or configure how each segment type is processed.

Processing Methods

  • Vision Language Models (VLM): Leverage AI models to generate HTML/Markdown content and run custom prompts
  • Heuristic-based Processing: Apply rule-based algorithms for consistent HTML/Markdown generation

Additional Features

  • Cropping: Get back the cropped images
  • Content to embed: Configure the content that will be used for chunking and embeddings

Our default processing works for most documents, and RAG use cases.

Note: Chunkr currently does not support creating embeddings, the embed_sources field will populate the embed field for the chunk.

Understanding the configuration

When you configure the SegmentProcessing settings, you are configuring how each segment type is processed. This means that anytime a segment type is identified, the configuration will be applied.

These are all the fields that are available for configuration:

GenerationConfig(
    html=GenerationStrategy.AUTO,
    markdown=GenerationStrategy.AUTO,
    crop_image=CroppingStrategy.AUTO,
    llm=None,
    embed_sources=[EmbedSource.MARKDOWN],
    extended_context=False,
)

Defaults

By default, Chunkr applies the following processing strategies for each segment type. You can override these defaults by specifying custom configuration in your SegmentProcessing settings. HTML, Markdown, and content are always returned. Extended context is off by default for all outputs (HTML, Markdown, LLM), and there are no html_page_extended or md_page_extended fields.

# Page, Table and Formula by default are processed using LLM.
# Formulas are returned as LaTeX.

default_llm_config = GenerationConfig(
    html=GenerationStrategy.LLM,
    markdown=GenerationStrategy.LLM,
    crop_image=CroppingStrategy.AUTO,
    llm=None,
    embed_sources=[EmbedSource.MARKDOWN],
    extended_context=False
)

default_config = Configuration(
    segment_processing=SegmentProcessing(
        Table=default_llm_config,
        Formula=default_llm_config,
    )
)

GenerationStrategy

The GenerationStrategy enum determines how Chunkr processes and generates output for a segment. It has two options:

  • GenerationStrategy.LLM: Uses a Vision Language Model (VLM) to analyze and generate descriptions of the segment content. This is particularly useful for complex segments like tables, charts, and images where you want AI-powered understanding.

  • GenerationStrategy.AUTO: Uses rule-based heuristics to process the segment. This is faster and works well for straightforward content like plain text, headers, and lists.

You can configure this strategy separately for HTML and Markdown output formats using the html and markdown fields in the configuration.

This is how you can access the html and markdown field in the segment object:

for chunk in task.output.chunks:
    for segment in chunk.segments:
        print(segment.html)
        print(segment.markdown)

CroppingStrategy

The CroppingStrategy enum controls how Chunkr handles image cropping for segments. It offers two options:

  • CroppingStrategy.ALL: Forces cropping for every segment, extracting just the content within its bounding box.

  • CroppingStrategy.AUTO: Lets Chunkr decide when cropping is necessary based on the segment type and post-processing requirements. For example, if an LLM is required to generate HTML from tables then they will be cropped.

for chunk in task.output.chunks:
    for segment in chunk.segments:
        print(segment.image)

Note: By default the image field contains a presigned URL to the cropped image that is valid for 10 minutes. You can also retrieve the image data as a base64 encoded string by following our best practices guide.

LLM Prompt

The llm field is used to pass a prompt to the LLM. This prompt is independent of the GenerationStrategy and will be applied to all segment types that have the llm field set.

If you need extended context for LLM processing, you must explicitly enable it by setting extended_context=True in your GenerationConfig. Extended context requires a page image to be available at runtime.

Extended context is particularly useful for:

  • Tables where legends aren’t properly segmented
  • Images that need surrounding page context for interpretation
  • Charts or diagrams that reference information elsewhere on the page
  • When segments need to “understand” their position in relation to other content

Note: The llm prompts can sometimes mess with the LLMs and cause refusals. If your tasks are failing, try changing the llm prompt.

Embed Sources

The embed_sources field is used to specify the sources of content that will be used for embeddings. This is useful if you want to use a different source of content for embeddings than the default HTML or Markdown. They will also be used to calculate the chunk length during chunking. See more information in the chunking section.

The embed sources is an array of sources. The index of the source will be used to determine which source appears first in the embed field.

For example, if you have [EmbedSource.MARKDOWN, EmbedSource.HTML], the Markdown content will appear first in the embed field. By default, the embed field will only contain the Markdown content.

Python
for chunk in task.output.chunks:
    print(chunk.embed)

Note: This is the only configuration option that affects the chunk object rather than the segment object.

When you set the embed_sources field:

  • You determine what content from segments will be included in the embed field of chunks
  • The order of sources in the array controls which content appears first in the embed field
  • This does not change the order of segments within chunks - reading order is always preserved

For example, if you set embed_sources=[EmbedSource.LLM, EmbedSource.MARKDOWN] for Tables, the LLM-generated content will appear before the markdown content in the embed field of any chunk containing a Table segment.

Extended Context

The extended_context flag controls whether Chunkr provides the full page context when processing a segment. When set to True, Chunkr will use the entire page image as context when processing the segment with an LLM.

Extended context is OFF by default for all segment types and output formats (HTML, Markdown, LLM). To leverage extended context, you must explicitly set extended_context=True within the GenerationConfig for the desired segment type(s).

Extended context is particularly beneficial for:

  • Tables/Charts with External Legends: When a legend or explanatory text is located elsewhere on the page but is crucial for interpreting the table/chart.
  • Images Requiring Surrounding Context: For images where understanding the surrounding text or other visual elements is necessary for accurate description or analysis.
  • Formulas/Diagrams: When the meaning depends on adjacent text or figures.
from chunkr_ai.models import Configuration, GenerationConfig, GenerationStrategy, SegmentProcessing

config = Configuration(
    segment_processing=SegmentProcessing(
        Table=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
            extended_context=True,  # Enable extended context
        ),
        Picture=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
            extended_context=True,  # Enable extended context
        )
        # ... other segment configs
    )
)

Example

Here is a quick example of how to use Chunkr to process a document with different segment processing configurations. This configuration will:

  • Summarize the key trends of all Table segments and populate the llm field with the LLM content in the segment
  • Enable extended context for Tables and Pictures to capture visual context from the full page
  • The embed field for chunks that container a Table segment will contain both the LLM content and the markdown for the table, with the LLM content appearing first.
  • Crop all SectionHeader segments to the bounding box.
  • All other segments will use their default processing.
from chunkr_ai import Chunkr
from chunkr_ai.models import (
    Configuration,
    CroppingStrategy,
    EmbedSource,
    GenerationConfig,
    GenerationStrategy,
    SegmentProcessing
)

chunkr = Chunkr()

chunkr.upload("path/to/file", Configuration(
    segment_processing=SegmentProcessing(
        Table=GenerationConfig(
            llm="Summarize the key trends in this table including any context from legends or surrounding text",
            embed_sources=[EmbedSource.LLM, EmbedSource.MARKDOWN],
            extended_context=True
        ),
        Picture=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
            crop_image=CroppingStrategy.ALL,
            extended_context=True,
        ),
        SectionHeader=GenerationConfig(
            crop_image=CroppingStrategy.ALL
        ),
    ),
))