
Key Features
- Perfect Markdown & HTML: LLM-ready content (Markdown, HTML, tables, etc).
- Reading order intact: Maintains the natural reading flow for complex layouts.
- Granular bounding boxes: Pinpoints element coordinates with precision for easy citations.
- Native Spreadsheet handling: 100% reconstruction with formulas, styling, and cell values preserved; precise ranges; cleans tables and converts charts to structured data.
- Post-processing: Token-aware chunking, cropped images, and more.
Example: Parse and access chunk content
Here’s how you can parse a document and access its chunks using our SDKs.Advanced Configuration
Advanced Configuration
For a comprehensive breakdown of every available configuration, please refer to our API Reference. Here is an overview of our configuration options:
- Pipeline (
pipeline): Choose the provider (AzureorChunkr) for layout analysis and OCR models. - Layout Analysis & OCR:
- Segmentation Strategy (
segmentation_strategy): Choose betweenLayoutAnalysis(default) or a full-page VLM approach for parsing. - OCR Strategy (
ocr_strategy): UseAutoto selectively apply OCR orAllto force it on every page.
- Segmentation Strategy (
- Segment-level Customization (
segment_processing): Control processing for each document element (e.g.,Text,Table,Picture):- Processing Strategy (
strategy): For each segment, set the strategy to generate HTML/Markdown.Auto(simple OCR + logic),LLM(VLM generation), orIgnore(remove from output). - Format Control (
format): Control the output format (MarkdownorHTML) for segment content. - Extended Context (
extended_context): Provide the full page image as additional context for VLM processing of a segment. Useful for cases like distant legends for tables and pictures. - Cropped Images (
crop_image): Control if a cropped image of the segment is included.
- Processing Strategy (
- Chunking (
chunk_processing): Configure chunking strategy, sizes, and token-counting model. - Error Handling (
error_handling): Set toFail(default) to stop on any error, orContinueto process despite non-critical errors.