Chunkr processes files by converting them into chunks, where each chunk contains a list of segments. This basic unit allows our API to be very flexible. See more information in the Layout Analysis section. After the segments are identified you can easily configure many post-processing capabilities. You can use our defaults or configure how each segment type is processed.

Processing Methods

  • Vision Language Models (VLM): Leverage AI models to generate HTML/Markdown content and run custom prompts
  • Heuristic-based Processing: Apply rule-based algorithms for consistent HTML/Markdown generation

Additional Features

  • Cropping: Get back the cropped images
  • Content to embed: Configure the content that will be used for chunking and embeddings
Our default processing works for most documents, and RAG use cases.
Note: Chunkr currently does not support creating embeddings. The embed field contains the concatenated content and descriptions (if enabled) from all segments in the chunk.

Understanding the configuration

When you configure the SegmentProcessing settings, you are configuring how each segment type is processed. This means that anytime a segment type is identified, the configuration will be applied. These are all the fields that are available for configuration:
GenerationConfig(
    format=SegmentFormat.MARKDOWN,
    strategy=GenerationStrategy.AUTO,
    crop_image=CroppingStrategy.AUTO,
    description=False,
    extended_context=False,
)

Defaults

By default, Chunkr applies the following processing strategies for each segment type. You can override these defaults by specifying custom configuration in your SegmentProcessing settings. Generated content and OCR text are always returned. Extended context is off by default for all outputs.
# Tables are processed using LLM, returned as HTML,
# and return descriptions.

default_table_config = GenerationConfig(
    format=SegmentFormat.HTML,
    strategy=GenerationStrategy.LLM,
    crop_image=CroppingStrategy.AUTO,
    description=True,
    extended_context=False
)

default_config = Configuration(
    segment_processing=SegmentProcessing(
        Table=default_table_config,
    )
)

SegmentFormat

The SegmentFormat enum determines the output format for the generated content. It has two options:
  • SegmentFormat.HTML: Generate HTML content for the segment
  • SegmentFormat.MARKDOWN: Generate Markdown content for the segment

GenerationStrategy

The GenerationStrategy enum determines how Chunkr processes and generates output for a segment. It has three options:
  • GenerationStrategy.LLM: Uses a Vision Language Model (VLM) to generate the segment content with segment specific prompts. This is particularly useful for complex segments like tables, charts, and images where you want AI-powered understanding.
  • GenerationStrategy.AUTO: Uses rule-based heuristics and traditional OCR to generate the segment content. This is faster and works well for straightforward content like plain text, headers, and lists.
  • GenerationStrategy.IGNORE: Excludes the segment from the final output entirely. This is useful when you want to filter out certain types of content.
You can configure both the format and strategy for each segment type using the format (HTML or Markdown) and strategy (LLM, AUTO, or IGNORE) fields in the configuration. This is how you can access the generated content and OCR text in the segment object:
for chunk in task.output.chunks:
    for segment in chunk.segments:
        print(segment.content)  # Generated content in the chosen format (HTML or Markdown)
        print(segment.text)     # OCR-extracted text

CroppingStrategy

The CroppingStrategy enum controls how Chunkr handles image cropping for segments. It offers two options:
  • CroppingStrategy.ALL: Forces cropping for every segment, extracting just the content within its bounding box.
  • CroppingStrategy.AUTO: Lets Chunkr decide when cropping is necessary based on the segment type and post-processing requirements. For example, if an LLM is required to generate content from tables then they will be cropped.
for chunk in task.output.chunks:
    for segment in chunk.segments:
        print(segment.image)
Note: By default the image field contains a presigned URL to the cropped image that is valid for 10 minutes. You can also retrieve the image data as a base64 encoded string by following our best practices guide.

Description Field

The description field is a boolean that controls whether Chunkr generates a descriptive summary for segments using an LLM. When set to True, the segment will include an additional description field containing AI-generated content that describes the segment.
for chunk in task.output.chunks:
    for segment in chunk.segments:
        print(segment.content)      # Generated content in the chosen format (HTML or Markdown)
        print(segment.text)         # OCR-extracted text
        print(segment.description)  # AI-generated description (if description=True)

Extended Context

The extended_context flag controls whether Chunkr provides the full page context when processing a segment. When set to True, Chunkr will use the entire page image as context when processing the segment with an LLM. Extended context is OFF by default for all segment types. To leverage extended context, you must explicitly set extended_context=True within the GenerationConfig for the desired segment type(s). Extended context is particularly beneficial for:
  • Tables/Charts with External Legends: When a legend or explanatory text is located elsewhere on the page but is crucial for interpreting the table/chart.
  • Images Requiring Surrounding Context: For images where understanding the surrounding text or other visual elements is necessary for accurate description or analysis.
  • Formulas/Diagrams: When the meaning depends on adjacent text or figures.
from chunkr_ai.models import Configuration, GenerationConfig, GenerationStrategy, SegmentProcessing, SegmentFormat

config = Configuration(
    segment_processing=SegmentProcessing(
        Table=GenerationConfig(
            format=SegmentFormat.MARKDOWN,
            strategy=GenerationStrategy.LLM,
            extended_context=True,  # Enable extended context
        ),
        Picture=GenerationConfig(
            format=SegmentFormat.HTML,
            strategy=GenerationStrategy.LLM,
            extended_context=True,  # Enable extended context
        )
        # ... other segment configs
    )
)

Embed Field Calculation

The embed field in chunks is automatically calculated based on the segment processing configuration. The content of the embed field includes:
  • The generated content from each segment
  • The description field (if description=True is set for that segment type)
Python
for chunk in task.output.chunks:
    print(chunk.embed)  # Contains concatenated content and descriptions from all segments in the chunk

Example

Here is a quick example of how to use Chunkr to process a document with different segment processing configurations. This configuration will:
  • Generate descriptive summaries for all Table segments and populate the description field with AI-generated content
  • Enable extended context for Tables and Pictures to capture visual context from the full page
  • Crop all SectionHeader segments to the bounding box
  • All other segments will use their default processing
from chunkr_ai import Chunkr
from chunkr_ai.models import (
    Configuration,
    CroppingStrategy,
    GenerationConfig,
    GenerationStrategy,
    SegmentProcessing,
    SegmentFormat
)

chunkr = Chunkr()

chunkr.upload("path/to/file", Configuration(
    segment_processing=SegmentProcessing(
        Table=GenerationConfig(
            extended_context=True
        ),
        Picture=GenerationConfig(
            extended_context=True,
        ),
        SectionHeader=GenerationConfig(
            crop_image=CroppingStrategy.ALL
        ),
    ),
))