Chunkr provides various post-processing capabilities. Once segments have been extracted, you can use our defaults or configure how each segment type is processed.

Processing Methods

  • Vision Language Models (VLM): Leverage AI models to generate HTML/Markdown content and run custom prompts
  • Heuristic-based Processing: Apply rule-based algorithms for consistent HTML/Markdown generation

Additional Features

  • Cropping: Get back the cropped images

These processing options allow you to build highly specific pipelines. Our default processing works for most documents, and RAG use cases.

Defaults

By default, Chunkr applies the following processing strategies for each segment type. You can override these defaults by specifying custom configuration in your SegmentProcessing settings. HTML and Markdown are always returned.

# Table and Formula by default are processed using LLM. 
# Formulas are returned as LaTeX.

default_llm_config = GenerationConfig(
    html=GenerationStrategy.LLM,
    markdown=GenerationStrategy.LLM,
    crop_image=CroppingStrategy.AUTO
)

default_config = Configuration(
    segment_processing=SegmentProcessing(
        Table=default_llm_config,
        Formula=default_llm_config,
    )
)

Example

Here is a quick example of how to use Chunkr to process a document with different segment processing configurations. This configuration will:

  • Summarize the key trends of all Table segments
  • Crop all SectionHeader segments to the bounding box
  • Generate HTML using heurstics and Markdown using a VLM for all Text segments
from chunkr_ai import Chunkr
from chunkr_ai.models import (
    Configuration, 
    CroppingStrategy, 
    GenerationConfig, 
    GenerationStrategy, 
    SegmentProcessing
)

chunkr = Chunkr()

chunkr.upload("path/to/file", Configuration(
    segment_processing=SegmentProcessing(
        Table=GenerationConfig(
            llm="Summarize the key trends in this table"
        ),
        SectionHeader=GenerationConfig(
            crop_image=CroppingStrategy.ALL
        ),
        Text=GenerationConfig(
            html=GenerationStrategy.AUTO, 
            markdown=GenerationStrategy.LLM
        ),
    ),
))