Post-processing of segments
Note: Chunkr currently does not support creating embeddings. The embed
field contains the concatenated content and descriptions (if enabled) from all segments in the chunk.
SegmentProcessing
settings, you are configuring how each segment type is processed.
This means that anytime a segment type is identified, the configuration will be applied.
These are all the fields that are available for configuration:
SegmentProcessing
settings.
Generated content and OCR text are always returned. Extended context is off by default for all outputs.
SegmentFormat
enum determines the output format for the generated content. It has two options:
SegmentFormat.HTML
: Generate HTML content for the segmentSegmentFormat.MARKDOWN
: Generate Markdown content for the segmentGenerationStrategy
enum determines how Chunkr processes and generates output for a segment. It has three options:
GenerationStrategy.LLM
: Uses a Vision Language Model (VLM) to generate the segment content with segment specific prompts. This is particularly useful for complex segments like tables, charts, and images where you want AI-powered understanding.
GenerationStrategy.AUTO
: Uses rule-based heuristics and traditional OCR to generate the segment content. This is faster and works well for straightforward content like plain text, headers, and lists.
GenerationStrategy.IGNORE
: Excludes the segment from the final output entirely. This is useful when you want to filter out certain types of content.
format
(HTML or Markdown) and strategy
(LLM, AUTO, or IGNORE) fields in the configuration.
This is how you can access the generated content and OCR text in the segment object:
CroppingStrategy
enum controls how Chunkr handles image cropping for segments. It offers two options:
CroppingStrategy.ALL
: Forces cropping for every segment, extracting just the content within its bounding box.
CroppingStrategy.AUTO
: Lets Chunkr decide when cropping is necessary based on the segment type and post-processing requirements.
For example, if an LLM is required to generate content from tables then they will be cropped.
Note: By default the image
field contains a presigned URL to the cropped image that is valid for 10 minutes.
You can also retrieve the image data as a base64 encoded string by following our best practices guide.
description
field is a boolean that controls whether Chunkr generates a descriptive summary for segments using an LLM. When set to True
, the segment will include an additional description
field containing AI-generated content that describes the segment.
extended_context
flag controls whether Chunkr provides the full page context when processing a segment. When set to True
, Chunkr will use the entire page image as context when processing the segment with an LLM.
Extended context is OFF by default for all segment types. To leverage extended context, you must explicitly set extended_context=True
within the GenerationConfig
for the desired segment type(s).
Extended context is particularly beneficial for:
embed
field in chunks is automatically calculated based on the segment processing configuration. The content of the embed
field includes:
content
from each segmentdescription
field (if description=True
is set for that segment type)Table
segments and populate the description
field with AI-generated contentSectionHeader
segments to the bounding box