Parse Overview

The Parse feature transforms complex documents into machine-readable data, optimized for LLMs.

It intelligently identifies document elements, processes them based on their type, and outputs clean HTML & Markdown content ready for AI applications and downstream workflow automation.

Key Features

Perfect Markdown & HTML: LLM-ready content (Markdown, HTML, tables, etc).
Reading order intact: Maintains the natural reading flow for complex layouts.
Granular bounding boxes: Pinpoints element coordinates with precision for easy citations.
Native Spreadsheet handling: 100% reconstruction with formulas, styling, and cell values preserved; precise ranges; cleans tables and converts charts to structured data.
Post-processing: Token-aware chunking, cropped images, and more.

Example: Parse and access chunk content

Here’s how you can parse a document and access its chunks using our SDKs.

import os
import time

from chunkr_ai import Chunkr

client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])

# Parse a document from URL
url = "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/doc.pdf"
task = client.tasks.parse.create(file=url)

# OR parse from local file
with open("path/to/doc.pdf", "rb") as f:
    file = client.files.create(file=f)
    task = client.tasks.parse.create(file=file.url)

print(f"Task created with ID: {task.task_id}")

# Wait for the task to complete
while True:
    task = client.tasks.parse.get(task_id=task.task_id)
    if task.completed:
        break
    else:
        print(f"Task {task.task_id} is {task.status}")
        time.sleep(3)


# Access the chunks from the output
if task.status == "Succeeded" and task.output is not None:
    for chunk in task.output.chunks:
        print(chunk.content)
else:
    print(f"Task failed with status: {task.status}")

Our default configuration is optimized through extensive testing and provides excellent results for most documents. You can customize parse if you have specific requirements.

Advanced Configuration

For a comprehensive breakdown of every available configuration, please refer to our API Reference. Here is an overview of our configuration options:

Pipeline (pipeline): Choose the provider (Azure or Chunkr) for layout analysis and OCR models.
Layout Analysis & OCR:
- Segmentation Strategy (segmentation_strategy): Choose between LayoutAnalysis (default) or a full-page VLM approach for parsing.
- OCR Strategy (ocr_strategy): Use Auto to selectively apply OCR or All to force it on every page.
Segment-level Customization (segment_processing): Control processing for each document element (e.g., Text, Table, Picture):
- Processing Strategy (strategy): For each segment, set the strategy to generate HTML/Markdown. Auto (simple OCR + logic), LLM (VLM generation), or Ignore (remove from output).
- Format Control (format): Control the output format (Markdown or HTML) for segment content.
- Extended Context (extended_context): Provide the full page image as additional context for VLM processing of a segment. Useful for cases like distant legends for tables and pictures.
- Cropped Images (crop_image): Control if a cropped image of the segment is included.
Chunking (chunk_processing): Configure chunking strategy, sizes, and token-counting model.
Error Handling (error_handling): Set to Fail (default) to stop on any error, or Continue to process despite non-critical errors.

Get Started

Task System

Features

Security

Parse Overview

Key Features

Example: Parse and access chunk content

Get Started

Task System

Features

Security

​Key Features

​Example: Parse and access chunk content

Key Features

Example: Parse and access chunk content