# Health Check
Source: https://docs.chunkr.ai/api-references/health/health-check
get /health
Confirmation that the service can respond to requests
# Cancel Task
Source: https://docs.chunkr.ai/api-references/task/cancel-task
get /api/v1/task/{task_id}/cancel
Cancel a task that hasn't started processing yet:
- For new tasks: Status will be updated to `Cancelled`
- For updating tasks: Task will revert to the previous state
Requirements:
- Task must have status `Starting`
# Create Task
Source: https://docs.chunkr.ai/api-references/task/create-task
post /api/v1/task/parse
Queues a document for processing and returns a TaskResponse containing:
- Task ID for status polling
- Initial configuration
- File metadata
- Processing status
- Creation timestamp
- Presigned URLs for file access
The returned task will typically be in a `Starting` or `Processing` state.
Use the `GET /task/{task_id}` endpoint to poll for completion.
# Delete Task
Source: https://docs.chunkr.ai/api-references/task/delete-task
delete /api/v1/task/{task_id}
Delete a task by its ID.
Requirements:
- Task must have status `Succeeded` or `Failed`
# Get Task
Source: https://docs.chunkr.ai/api-references/task/get-task
get /api/v1/task/{task_id}
Retrieves detailed information about a task by its ID, including:
- Processing status
- Task configuration
- Output data (if processing is complete)
- File metadata (name, page count)
- Timestamps (created, started, finished)
- Presigned URLs for accessing files
This endpoint can be used to:
1. Poll the task status during processing
2. Retrieve the final output once processing is complete
3. Access task metadata and configuration
# Update Task
Source: https://docs.chunkr.ai/api-references/task/update-task
patch /api/v1/task/{task_id}/parse
Updates an existing task's configuration and reprocesses the document.
The original configuration will be used for all values that are not provided in the update.
Requirements:
- Task must have status `Succeeded` or `Failed`
- New configuration must be different from the current one
The returned task will typically be in a `Starting` or `Processing` state.
Use the `GET /task/{task_id}` endpoint to poll for completion.
# Get Tasks
Source: https://docs.chunkr.ai/api-references/tasks/get-tasks
get /api/v1/tasks
Retrieves a list of tasks
Example usage:
`GET /api/v1/tasks?page=1&limit=10&include_chunks=false`
# Chunking
Source: https://docs.chunkr.ai/docs/features/chunking
Chunking
Chunking is the process of splitting a document into smaller segments.
These chunks can be used for semantic search, and better LLM performance.
By leveraging layout analysis, we create intelligent chunks that preserve document structure and context. Our algorithm:
* Respects natural document boundaries (paragraphs, sections)
* Maintains semantic relationships between segments
* Optimizes chunk size for LLM processing
You can review the implementation of our chunking algorithm in our [GitHub repository](https://github.com/lumina-ai-inc/chunkr/blob/main/core/src/utils/services/chunking.rs#L113).
Here is an example that will chunk the document into 512 words per chunks. These values are also the defaults, so you don't need to specify them.
```python Python
from chunkr_ai import Chunkr
from chunkr_ai.models import (
ChunkProcessing,
Configuration
Tokenizer,
)
chunkr = Chunkr()
chunkr.upload("path/to/file", Configuration(
chunk_processing=ChunkProcessing(
ignore_headers_and_footers=True,
target_length=512,
tokenizer=Tokenizer.WORD
),
))
```
```bash cURL
curl --request POST \
--url https://api.chunkr.ai/api/v1/task/parse \
--header 'Authorization: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
"file": "base64_encoded_file_content",
"file_name": "document.pdf",
"chunk_processing": {
"ignore_headers_and_footers": false,
"target_length": 512,
"tokenizer": {
"Enum": "Word"
}
}
}'
```
### Defaults
* `ignore_headers_and_footers`: True
* `target_length`: 512
* `tokenizer`: `Word`
## Tokenizer
Chunkr supports a large number of tokenizers. You can use our predefined ones or specify any tokenizer from huggingface.
### Predefined Tokenizers
The predefined tokenizers are enum values and can be used as follows:
```python Python
from chunkr_ai import Chunkr
from chunkr_ai.models import (
ChunkProcessing,
Configuration
Tokenizer,
)
chunkr = Chunkr()
chunkr.upload("path/to/file", Configuration(
chunk_processing=ChunkProcessing(
tokenizer=Tokenizer.CL100K_BASE
),
))
```
```bash cURL
curl --request POST \
--url https://api.chunkr.ai/api/v1/task/parse \
--header 'Authorization: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
"file": "base64_encoded_file_content",
"file_name": "document.pdf",
"chunk_processing": {
"tokenizer": {
"Enum": "Cl100kBase"
}
}
}'
```
Available options:
* `Word`: Split by words
* `Cl100kBase`: For OpenAI models (e.g. GPT-3.5, GPT-4, text-embedding-ada-002)
* `XlmRobertaBase`: For RoBERTa-based multilingual models
* `BertBaseUncased`: BERT base uncased tokenizer
You can also define the tokenizer enum as a string in the python SDK. Here is an example where the string will be converted to the enum value.
```python Python
from chunkr_ai import Chunkr
from chunkr_ai.models import (
ChunkProcessing,
Configuration
Tokenizer,
)
chunkr = Chunkr()
chunkr.upload("path/to/file", Configuration(
chunk_processing=ChunkProcessing(
tokenizer="Word"
),
))
```
### Hugging Face Tokenizers
Use any Hugging Face tokenizer by providing its model ID as a string (e.g. "facebook/bart-large", "Qwen/Qwen-tokenizer", etc.)
```python Python
from chunkr_ai import Chunkr
from chunkr_ai.models import (
ChunkProcessing,
Configuration
Tokenizer,
)
chunkr = Chunkr()
chunkr.upload("path/to/file", Configuration(
chunk_processing=ChunkProcessing(
tokenizer="Qwen/Qwen-tokenizer"
),
))
```
```bash cURL
curl --request POST \
--url https://api.chunkr.ai/api/v1/task/parse \
--header 'Authorization: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
"file": "base64_encoded_file_content",
"file_name": "document.pdf",
"chunk_processing": {
"tokenizer": {
"String": "Qwen/Qwen-tokenizer"
}
}
}'
```
## Calculating Chunk Lengths With Embed Sources
When calculating chunk lengths and performing tokenization, we use the text from the `embed` field in each chunk object. This field contains the text that will be compared against the target length.
You can configure what text goes into the `embed` field by setting the `embed_sources` parameter in your segment processing configuration. This parameter is specified under `segment_processing.{segment_type}` in your configuration.
You can see more information about the `embed_sources` parameter in the [Segment Processing](/features/segment-processing) section.
Here's an example of customizing the `embed` field content for Picture segments. By configuring `embed_sources`, you can include both the LLM-generated output and Chunkr's markdown output in the `embed` field for Pictures, while other segment types will continue using just the default Markdown content.
Additionally, we can use `CL100K_BASE` tokenizer to configure this for OpenAI models.
This means for this configuration, when calculating chunk lengths:
* Picture segments: Length will be based on both the LLM summary and Markdown content
* All other segments: Length will be based only on the Markdown content
* The tokenizer will be `CL100K_BASE`
```python Python
from chunkr_ai import Chunkr
from chunkr_ai.models import (
ChunkProcessing,
Configuration,
SegmentProcessing,
Tokenizer,
)
chunkr = Chunkr()
chunkr.upload("path/to/file", Configuration(
chunk_processing=ChunkProcessing(
tokenizer=Tokenizer.CL100K_BASE
),
segment_processing=SegmentProcessing(
Picture=SegmentProcessingPicture(
llm="Summarize the key information presented",
embed_sources=[EmbedSource.MARKDOWN, EmbedSource.LLM]
)
)
))
```
```bash cURL
curl --request POST \
--url https://api.chunkr.ai/api/v1/task/parse \
--header 'Authorization: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
"file": "base64_encoded_file_content",
"file_name": "document.pdf",
"chunk_processing": {
"tokenizer": {
"Enum": "Cl100kBase"
}
},
"segment_processing": {
"Picture": {
"llm": "Summarize the key information presented",
"embed_sources": ["Markdown", "LLM"]
}
}
}'
```
By combining the `embed_sources` parameter with the `tokenizer` parameter, you can customize the chunk lengths and tokenization for different segment types.
This allows you to have very powerful chunking configurations for your documents.
# Segmentation Strategy
Source: https://docs.chunkr.ai/docs/features/layout-analysis/segmentation_strategy
Controls the segmentation strategy
The chunkr AI API allows you to specify a `segmentation_strategy` for each document. This strategy controls how the document is segmented.
We have two strategies:
* `LayoutAnalysis`: Run our state-of-the-art layout analysis model to identify the layout elements. This is the default strategy.
* `Page`: Each segment is a page.
This is how you can configure the segmentation strategy:
```python Python
from chunkr_ai import Chunkr
from chunkr_ai.models import Configuration, SegmentationStrategy
chunkr = Chunkr()
chunkr.upload("path/to/file", Configuration(
segmentation_strategy=SegmentationStrategy.LAYOUT_ANALYSIS
))
```
```bash cURL
curl --request POST \
--url https://api.chunkr.ai/api/v1/task/parse \
--header 'Authorization: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
"file": "base64_encoded_file_content",
"file_name": "document.pdf",
"segmentation_strategy": "LayoutAnalysis"
}'
```
## When to use each strategy
For most documents, we recommend using the `LayoutAnalysis` strategy. This will give you the best results.
Use `Page` for:
* Faster processing speed when you need quick results and layout isn't critical
* Documents with unusual layouts that confuse the layout analysis model
* If the layout is complex but not very information dense, `Page` + VLM can generate surprisingly good HTML and markdown (see [Segment Processing](/docs/features/segment-processing)).
# What is Layout Analysis?
Source: https://docs.chunkr.ai/docs/features/layout-analysis/what
Understand the importance of layout analysis in document processing
Layout analysis is a crucial step in document processing that involves analyzing and understanding the spatial arrangement of content within a document.
It helps identify and classify different regions of a document, such as `text`, `table`, `headers`, `footers`, and `pictures`.
Basically, it tells us where and what is in the document.
## Why is Layout Analysis Important?
Layout analysis serves several key purposes:
* **Structure Recognition**: It helps identify the logical structure and reading order of a document
* **Data Extraction**: By identifying specific regions (like tables, headers, or paragraphs), we can use specialized extraction methods for each type, improving accuracy
* **Better Chunking**: Layout elements allows us to identify sections of the document and generate better chunks.
* **Citations**: It allows LLMs to cite the correct region of the document, which can then be highlighted for a better experience.
## Segment Types
Chunkr uses a two way vision-grid transformer to identify the layout of the document.
We support the following segment types:
* **Caption**: Text describing figures, tables, or other visual elements
* **Footnote**: References or additional information at the bottom of pages
* **Formula**: Mathematical or scientific equations
* **List Item**: Individual items in bulleted or numbered lists
* **Page**: Entire page (`segmentation_strategy=Page`)
* **Page Footer**: Content that appears at the bottom of each page
* **Page Header**: Content that appears at the top of each page
* **Picture**: Images, diagrams, or other visual elements
* **Section Header**: Headers that divide the document into sections
* **Table**: Structured data arranged in rows and columns
* **Text**: Regular paragraph text
* **Title**: Main document title
# Optical Character Recognition (OCR)
Source: https://docs.chunkr.ai/docs/features/ocr
Extract text from images
Optical Character Recognition (OCR) is a technology that converts different types of documents,
such as scanned paper documents, PDF files, or images, into editable and searchable data.
## OCR Strategy
Chunkr AI API always returns OCR results. You can configure the OCR strategy using the `ocr_strategy` parameter.
We have two strategies:
* `All` (Default): Processes all pages with our OCR model.
* `Auto`: Intelligently applies OCR only to pages with missing or low-quality text. When a text layer is present, the bounding boxes from that layer are used instead of running OCR.
```python Python
from chunkr_ai import Chunkr
from chunkr_ai.models import Configuration, OcrStrategy
chunkr = Chunkr()
chunkr.upload("path/to/file", Configuration(
ocr_strategy=OcrStrategy.AUTO # can also be OcrStrategy.ALL
))
```
```bash cURL
curl --request POST \
--url https://api.chunkr.ai/api/v1/task/parse \
--header 'Authorization: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
"file": "base64_encoded_file_content",
"file_name": "document.pdf",
"ocr_strategy": "Auto"
}'
```
The `Auto` strategy provides the best balance between accuracy and performance for most use cases.
Use the `All` strategy when you need to ensure consistent text extraction across all pages or when you suspect the existing text layer might be unreliable.
## OCR + Layout Analysis
OCR and Layout Analysis together are a powerful combination.
It allows us to get word level bounding boxes and text while also understanding the layout of the document.
You can use that to make experiences like:
* Highlighting exact numbers in a table
* Highlighting text in images
* Embedding the text from pictures for semantic search
## Other common use cases
* Digitizing old books and documents
* Processing invoices and receipts
* Automating form data entry
* Reading license plates
* Converting handwritten notes to digital text
* Extracting text from screenshots and images
# Configuration
Source: https://docs.chunkr.ai/docs/features/overview
Configure the API to your needs
Different applications have different needs. Chunkr AI API is designed to be flexible and customizable to meet your specific requirements.
We support the following configuration options:
* `chunk_processing`: Controls the setting for the chunking and post-processing of each chunk.
* `expires_in`: The number of seconds until task is deleted.
* `high_resolution`: Whether to use high-resolution images for cropping and post-processing.
* `ocr_strategy`: Controls the Optical Character Recognition (OCR) strategy.
* `pipeline`: Options for layout analysis and OCR providers.
* `segment_processing`: Controls the post-processing of each segment type. Allows you to generate HTML, markdown and run custom VLM prompts.
* `segmentation_strategy`: Controls the segmentation strategy
The configuration options can be combined to create a customized processing pipeline. When a `Task` is created, the configuration is done through the `Configuration` object.
Here is an example of how to configure the API to run a custom VLM prompt on each picture in a document:
```python Python
from chunkr_ai import Chunkr
from chunkr_ai.models import (
Configuration,
GenerationConfig,
GenerationStrategy,
SegmentProcessing,
SegmentationStrategy
)
chunkr = Chunkr()
chunkr.upload("path/to/file", Configuration(
segment_processing=SegmentProcessing(
picture=GenerationConfig(
llm="Does this picture have a cat in it? Answer must be true or false."
)
),
))
```
```bash cURL
curl --request POST \
--url https://api.chunkr.ai/api/v1/task/parse \
--header 'Authorization: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
"file": "base64_or_url_to_file",
"file_name": "document.pdf",
"segment_processing": {
"picture": {
"llm": "Does this picture have a cat in it? Answer must be true or false."
}
}
}'
```
# Pipeline
Source: https://docs.chunkr.ai/docs/features/pipeline
Choose providers to process your documents
In addition to using chunkr's default models, we also provide a pipeline interface to allow you to use Azure Document Intelligence as a provider.
When using Azure, instead of the default models, your files are processed through the Azure layout analysis model, the Azure OCR model, and the Azure table OCR model.
You can still leverage Chunkr's intelligent chunking and segment processing. The output will be mapped to the Chunkr output format.
## When to use Azure
* If our queue is full, you can use Azure to process your files
* If you don't need VLMs on your tables, you can use the Azure table OCR model. This will allow you to get much faster results.
* Better OCR (we are working on it!)
We improve the outputs from Azure with a combination of last-mile engineering and LLMs.
In our testing, the hybrid approach (traditional layout analysis + OCR for simple elements and LLMs for complex elements) has the most accurate results.
## Example
1. Use default segment processing and chunking with the Chunkr layout analysis model and OCR model.
```python Python
from chunkr_ai import Chunkr
from chunkr_ai.models import (
Configuration,
Pipeline
)
chunkr = Chunkr()
chunkr.upload("path/to/file", Configuration(
pipeline=Pipeline.CHUNKR
))
```
```bash cURL
curl --request POST \
--url https://api.chunkr.ai/api/v1/task/parse \
--header 'Authorization: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
"file": "base64_encoded_file_content",
"file_name": "document.pdf",
"pipeline": "Chunkr"
}'
```
2. Use default chunking with the Azure layout analysis model, OCR model and table OCR model.
In this case, the HTML and Markdown for the `Table` segment will be generated by the Azure table OCR model.
```python Python
from chunkr_ai import Chunkr
from chunkr_ai.models import (
Configuration,
GenerationConfig,
GenerationStrategy,
SegmentProcessing,
Pipeline
)
chunkr = Chunkr()
chunkr.upload("path/to/file", Configuration(
segment_processing=SegmentProcessing(
Table=GenerationConfig(
html=GenerationStrategy.AUTO,
markdown=GenerationStrategy.AUTO
),
),
pipeline=Pipeline.AZURE,
))
```
```bash cURL
curl --request POST \
--url https://api.chunkr.ai/api/v1/task/parse \
--header 'Authorization: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
"file": "base64_encoded_file_content",
"file_name": "document.pdf",
"segment_processing": {
"Table": {
"html": "Auto",
"markdown": "Auto"
}
},
"pipeline": "Azure"
}'
```
# Segment Processing
Source: https://docs.chunkr.ai/docs/features/segment-processing
Post-processing of segments
Chunkr processes files by converting them into chunks, where each chunk contains a list of segments. This basic unit allows our API to be very flexible. See more information in the [Layout Analysis](./layout-analysis/segmentation_strategy.mdx) section.
After the segments are identified you can easily configure many post-processing capabilities. You can use our defaults or configure how each segment type is processed.
#### Processing Methods
* **Vision Language Models (VLM)**: Leverage AI models to generate HTML/Markdown content and run custom prompts
* **Heuristic-based Processing**: Apply rule-based algorithms for consistent HTML/Markdown generation
#### Additional Features
* **Cropping**: Get back the cropped images
* **Content to embed**: Configure the content that will be used for chunking and embeddings
Our default processing works for most documents, and RAG use cases.
> **Note**: Chunkr currently does not support creating embeddings, the embed\_sources field will populate the `embed` field for the `chunk`.
## Understanding the configuration
When you configure the `SegmentProcessing` settings, you are configuring how each segment type is processed.
This means that anytime a segment type is identified, the configuration will be applied.
These are all the fields that are available for configuration:
```python
GenerationConfig(
html=GenerationStrategy.AUTO,
markdown=GenerationStrategy.AUTO,
crop_image=CroppingStrategy.AUTO,
llm=None,
embed_sources=[EmbedSource.MARKDOWN],
)
```
### Defaults
By default, Chunkr applies the following processing strategies for each segment type.
You can override these defaults by specifying custom configuration in your `SegmentProcessing` settings.
HTML, Markdown, and content are always returned.
```python Page, Tables and Formulas
# Page, Table and Formula by default are processed using LLM.
# Formulas are returned as LaTeX.
default_llm_config = GenerationConfig(
html=GenerationStrategy.LLM,
markdown=GenerationStrategy.LLM,
crop_image=CroppingStrategy.AUTO,
llm=None
embed_sources=[EmbedSource.MARKDOWN]
)
default_config = Configuration(
segment_processing=SegmentProcessing(
Table=default_llm_config,
Formula=default_llm_config,
)
)
```
```python Pictures
# Pictures by default are processed using LLM and are cropped by default.
default_picture_config = GenerationConfig(
html=GenerationStrategy.LLM,
markdown=GenerationStrategy.LLM,
crop_image=CroppingStrategy.ALL,
llm=None,
embed_sources=[EmbedSource.MARKDOWN]
)
default_config = Configuration(
segment_processing=SegmentProcessing(
Picture=default_picture_config
)
)
```
```python Other Elements
# All other element's HTML and Markdown are processed using heuristics.
default_text_config = GenerationConfig(
html=GenerationStrategy.AUTO,
markdown=GenerationStrategy.AUTO,
crop_image=CroppingStrategy.AUTO,
llm=None,
embed_sources=[EmbedSource.MARKDOWN]
)
default_config = Configuration(
segment_processing=SegmentProcessing(
Title=default_text_config,
SectionHeader=default_text_config,
Text=default_text_config,
ListItem=default_text_config,
Caption=default_text_config,
Footnote=default_text_config,
PageHeader=default_text_config,
PageFooter=default_text_config,
)
)
```
### GenerationStrategy
The `GenerationStrategy` enum determines how Chunkr processes and generates output for a segment. It has two options:
* `GenerationStrategy.LLM`: Uses a Vision Language Model (VLM) to analyze and generate descriptions of the segment content. This is particularly useful for complex segments like tables, charts, and images where you want AI-powered understanding.
* `GenerationStrategy.AUTO`: Uses rule-based heuristics to process the segment. This is faster and works well for straightforward content like plain text, headers, and lists.
You can configure this strategy separately for HTML and Markdown output formats using the `html` and `markdown` fields in the configuration.
This is how you can access the `html` and `markdown` field in the segment object:
```python
for chunk in task.output.chunks:
for segment in chunk.segments:
print(segment.html)
print(segment.markdown)
```
### CroppingStrategy
The `CroppingStrategy` enum controls how Chunkr handles image cropping for segments. It offers two options:
* `CroppingStrategy.ALL`: Forces cropping for every segment, extracting just the content within its bounding box.
* `CroppingStrategy.AUTO`: Lets Chunkr decide when cropping is necessary based on the segment type and post-processing requirements.
For example, if an LLM is required to generate HTML from tables then they will be cropped.
This is how you can access the `image` field in the `segment` object:
```python
for chunk in task.output.chunks:
for segment in chunk.segments:
print(segment.image)
```
> **Note**: By default the `image` field contains a presigned URL to the cropped image that is valid for 10 minutes.
> You can also retrieve the image data as a base64 encoded string by following our [best practices guide](/sdk/data-operations/get#best-practices).
### LLM Prompt
The `llm` field is used to pass a prompt to the LLM. This prompt is independent of the `GenerationStrategy` and will be applied to all segment types that have the `llm` field set.
> **Note**: The `llm` prompts can sometimes mess with the LLMs and cause refusals. If your tasks are failing, try changing the `llm` prompt.
### Embed Sources
The `embed_sources` field is used to specify the sources of content that will be used for embeddings.
This is useful if you want to use a different source of content for embeddings than the default HTML or Markdown.
They will also be used to calculate the chunk length during chunking. See more information in the [chunking](./chunking#calculating-chunk-lengths-with-embed-sources) section.
The embed sources is an array of sources. The index of the source will be used to determine which source appears first in the `embed` field.
For example, if you have `[EmbedSource.MARKDOWN, EmbedSource.HTML]`, the Markdown content will appear first in the `embed` field.
By default, the `embed` field will only contain the Markdown content.
This is how you can access the `embed` field in the `chunk` object:
```python
for chunk in task.output.chunks:
print(chunk.embed)
```
> **Note**: This is the only configuration option that affects the `chunk` object rather than the `segment` object.
>
> When you set the `embed_sources` field:
>
> * You determine what content from segments will be included in the `embed` field of chunks
> * The order of sources in the array controls which content appears first in the `embed` field
> * This does not change the order of segments within chunks - reading order is always preserved
>
> For example, if you set `embed_sources=[EmbedSource.LLM, EmbedSource.MARKDOWN]` for Tables, the LLM-generated content will appear before the markdown content in the `embed` field of any chunk containing a Table segment.
## Example
Here is a quick example of how to use Chunkr to process a document with different segment processing configurations.
This configuration will:
* Summarize the key trends of all `Table` segments and populate the `llm` field with the LLM content in the segment
* The `embed` field for chunks that container a `Table` segment will contain both the LLM content and the markdown for the table,
with the LLM content appearing first.
* Crop all `SectionHeader` segments to the bounding box.
* All other segments will use their default processing.
```python Python
from chunkr_ai import Chunkr
from chunkr_ai.models import (
Configuration,
CroppingStrategy,
EmbedSource,
GenerationConfig,
GenerationStrategy,
SegmentProcessing
)
chunkr = Chunkr()
chunkr.upload("path/to/file", Configuration(
segment_processing=SegmentProcessing(
Table=GenerationConfig(
llm="Summarize the key trends in this table"
embed_source=[EmbedSource.LLM, EmbedSource.MARKDOWN]
),
SectionHeader=GenerationConfig(
crop_image=CroppingStrategy.ALL
),
),
))
```
```bash cURL
curl -X POST https://api.chunkr.ai/api/v1/task \
-H "Authorization: YOUR_API_KEY" \
-F file=@path/to/file \
-F 'segment_processing={
"Table": {
"llm": "Summarize the key trends in this table"
},
"SectionHeader": {
"crop_image": "All"
},
"Text": {
"html": "Auto",
"markdown": "LLM"
}
};type=application/json'
```
# Changelog
Source: https://docs.chunkr.ai/docs/get-started/changelog
Please refer to our [GitHub Changelog](https://github.com/lumina-ai-inc/chunkr/blob/main/CHANGELOG.md) for the latest updates.
# LLM Documentation
Source: https://docs.chunkr.ai/docs/get-started/llm
LLM-ready documentation for Chunkr AI
## Available Formats
We offer two primary formats for LLMs:
* **Condensed Documentation**: [https://docs.chunkr.ai/llms.txt](https://docs.chunkr.ai/llms.txt)
Streamlined version optimized for quick reference by LLMs. These are also helpful for MCP servers.
* **Full Documentation**: [https://docs.chunkr.ai/llms-full.txt](https://docs.chunkr.ai/llms-full.txt)
Complete documentation with all details and examples. Can be dumped directly into context.
## How to Use
[Here](https://youtu.be/fk2WEVZfheI) is a helpful video on how to intergrate llm.txt and MCP servers.
# Chunkr AI
Source: https://docs.chunkr.ai/docs/get-started/overview
Open Source Document Intelligence
Learn how to get started with Chunkr AI API
Explore examples and strategies
Access our client libraries
Dive into our API reference
## Features
Preserve document structure with advanced layout detection
Leverage Vision Language Models for enhanced document understanding
Extract text from images and scanned documents with high accuracy
Split documents into meaningful sections using layout-aware algorithms
Options for layout analysis and OCR providers
Process PDFs, Office files (Word, Excel, PowerPoint), and images through a single API
# Developer Quickstart
Source: https://docs.chunkr.ai/docs/get-started/quickstart
Learn how to get started with Chunkr AI API
Chunkr AI is an API service to convert complex documents into LLM/RAG-ready data. We support a wide range of document types, including PDFs, Office files (Word, Excel, PowerPoint), and images.
## Getting Started
To get started with Chunkr AI, follow these simple steps to set up your account and integrate our API into your application.
### Step 1: Sign Up and Create an API Key
1. Visit [Chunkr AI](https://chunkr.ai)
2. Click on "Login" and create your account
3. Once logged in, navigate to "API Keys" in the dashboard
### Step 2: Install our client SDK
```bash Python
pip install chunkr-ai
```
### Step 3: Upload your document
```python Python
from chunkr_ai import Chunkr
# Initialize the Chunkr client with your API key - get this from https://chunkr.ai
chunkr = Chunkr(api_key="your_api_key")
# Upload a document via url or local file path
url = "https://chunkr-web.s3.us-east-1.amazonaws.com/landing_page/input/specs.pdf"
task = chunkr.upload(url)
```
### Step 4: Export the results
Chunkr AI will return a `TaskResponse` object. This object contains the results of the document conversion. You can export the results in various formats or load them into a variable.
```python Python
# Export HTML of document
html = task.html(output_file="output.html")
# Export markdown of document
markdown = task.markdown(output_file="output.md")
# Export text of document
content = task.content(output_file="output.txt")
# Export result as JSON - TaskResponse is already in memory so no need to load it into a variable
task.json(output_file="output.json")
```
### Step 5: Explore the output
The output of the task can be used to build your RAG pipeline.
Checkout the [API Reference](/api-references/task/create-task#response-output-chunks) for more details.
```python Python
# The output of the task is a list of chunks
chunks = task.output.chunks
# Each chunk is a list of segments
for chunk in chunks:
for segment in chunk.segments:
print(segment.segment_type)
# You can also access the `embed` field in the chunk
# for content to be used in RAG pipelines
for chunk in chunks:
print(chunk.embed)
```
### Step 6: Clean up
You can clean up the open connections by calling the `close()` method on the `Chunkr` client.
```python Python
chunkr.close()
```
## Authentication Options
You can authenticate with the Chunkr AI API in two ways:
1. **Direct API Key** - Pass your API key directly when initializing the client
2. **Environment Variable** - Set `CHUNKR_API_KEY` in your `.env` file
```python Python
from chunkr_ai import Chunkr
# Option 1: Initialize with API key directly
chunkr = Chunkr(api_key="your_api_key")
# Option 2: Initialize without api_key parameter - will use CHUNKR_API_KEY from environment
chunkr = Chunkr()
```
## Self Hosted
If you're using a self-hosted deployment of Chunkr AI, you can configure the API URL when initializing the client:
```python Python
from chunkr_ai import Chunkr
# Option 1: With direct API key
chunkr = Chunkr(
api_key="your_api_key",
base_url="https://your-self-hosted-chunkr.com"
)
# Option 2: Using environment variables
# Set CHUNKR_API_KEY and CHUNKR_URL in your .env file
chunkr = Chunkr()
```
When using environment variables for self-hosted deployments, set both `CHUNKR_API_KEY` and `CHUNKR_URL` in your `.env` file.
# Docker compose
Source: https://docs.chunkr.ai/docs/self-hosting/docker-compose
Please refer to our [GitHub README](https://github.com/lumina-ai-inc/chunkr?tab=readme-ov-file#quick-start-with-docker-compose) for instructions on how to get started with Chunkr AI using Docker Compose.
# Kubernetes
Source: https://docs.chunkr.ai/docs/self-hosting/kubernetes
Please refer to our [GitHub README](https://github.com/lumina-ai-inc/chunkr?tab=readme-ov-file#quick-start-with-kubernetes) for instructions on how to get started with Chunkr AI using Kubernetes.
# Bulk Upload
Source: https://docs.chunkr.ai/docs/use-cases/bulk-upload
Learn how to efficiently process multiple files with Chunkr AI
Here's how to efficiently process multiple files using Chunkr AI's async capabilities.
## Process a Directory
Here's a simple script to process all files in a directory:
```python Python
import asyncio
from chunkr_ai import Chunkr
import os
from pathlib import Path
chunkr = Chunkr()
async def process_directory(input_dir: str, output_dir: str):
try:
# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)
# Get all files in directory
files = list(Path(input_dir).glob('*.*'))
print(f"Found {len(files)} files to process")
# Process files concurrently
tasks = []
for file_path in files:
task = asyncio.create_task(process_file(chunkr, file_path, output_dir))
tasks.append(task)
# Wait for all files to complete
results = await asyncio.gather(*tasks)
print(f"Completed processing {len(results)} files")
except Exception as e:
print(f"Error processing directory: {e}")
async def process_file(chunkr, file_path, output_dir):
try:
# Upload file
result = await chunkr.upload(file_path)
# Check if upload was successful
if result.status == "Failed":
print(f"Failed to process file {file_path}: {result.message}")
return None
# Save result
file_name = file_path.name
output_file_path = Path(output_dir) / f"{file_name}.json"
result.json(output_file_path)
return file_name
except Exception as e:
print(f"Error processing file {file_path}: {e}")
return None
# Run the processor
if __name__ == "__main__":
INPUT_DIR = "/data/Chunkr/dataset/files"
OUTPUT_DIR = "processed/"
asyncio.run(process_directory(INPUT_DIR, OUTPUT_DIR))
```
# Configuration
Source: https://docs.chunkr.ai/sdk/configuration
Learn how to configure tasks in Chunkr AI
Chunkr AI allows you to configure tasks with a `Configuration` object. All configurations can be used together.
```python Python
from chunkr_ai.models import ChunkProcessing, Configuration, OcrStrategy
config = Configuration(
chunk_processing=ChunkProcessing(target_length=1024),
expires_in=3600,
high_resolution=True,
ocr_strategy=OcrStrategy.AUTO,
)
task = chunkr.upload("path/to/your/file", config)
```
## Available Configuration Examples
### Chunk Processing
```python Python
from chunkr_ai.models import ChunkProcessing
config = Configuration(
chunk_processing=ChunkProcessing(
ignore_headers_and_footers=True,
target_length=1024
)
)
```
### Expires In
```python Python
config = Configuration(expires_in=3600)
```
### High Resolution
```python Python
config = Configuration(high_resolution=True)
```
### OCR Strategy
```python Python
config = Configuration(ocr_strategy=OcrStrategy.AUTO) # or OcrStrategy.ALL
```
### Segment Processing
This example show cases all the options for segment processing. This is what the default configuration looks like, and is applied if nothing is specified.
For your own configuration, you can customize the options you want to change and the rest will be applied by default.
```python Python
from chunkr_ai.models import (
Configuration,
CroppingStrategy,
GenerationConfig,
GenerationStrategy,
SegmentProcessing
)
config = Configuration(
segment_processing=SegmentProcessing(
Caption=GenerationConfig(
crop_image=CroppingStrategy.AUTO,
html=GenerationStrategy.AUTO,
markdown=GenerationStrategy.AUTO,
llm=None
),
Formula=GenerationConfig(
crop_image=CroppingStrategy.AUTO,
html=GenerationStrategy.LLM,
markdown=GenerationStrategy.LLM,
llm=None
),
Footnote=GenerationConfig(
crop_image=CroppingStrategy.AUTO,
html=GenerationStrategy.AUTO,
markdown=GenerationStrategy.AUTO,
llm=None
),
ListItem=GenerationConfig(
crop_image=CroppingStrategy.AUTO,
html=GenerationStrategy.AUTO,
markdown=GenerationStrategy.AUTO,
llm=None
),
Page=GenerationConfig(
crop_image=CroppingStrategy.AUTO,
html=GenerationStrategy.AUTO,
markdown=GenerationStrategy.AUTO,
llm=None
),
PageFooter=GenerationConfig(
crop_image=CroppingStrategy.AUTO,
html=GenerationStrategy.AUTO,
markdown=GenerationStrategy.AUTO,
llm=None
),
PageHeader=GenerationConfig(
crop_image=CroppingStrategy.AUTO,
html=GenerationStrategy.AUTO,
markdown=GenerationStrategy.AUTO,
llm=None
),
Picture=GenerationConfig(
crop_image=CroppingStrategy.ALL,
html=GenerationStrategy.AUTO,
markdown=GenerationStrategy.AUTO,
llm=None
),
SectionHeader=GenerationConfig(
crop_image=CroppingStrategy.AUTO,
html=GenerationStrategy.AUTO,
markdown=GenerationStrategy.AUTO,
llm=None
),
Table=GenerationConfig(
crop_image=CroppingStrategy.AUTO,
html=GenerationStrategy.LLM,
markdown=GenerationStrategy.LLM,
llm=None
),
Text=GenerationConfig(
crop_image=CroppingStrategy.AUTO,
html=GenerationStrategy.AUTO,
markdown=GenerationStrategy.AUTO,
llm=None
),
Title=GenerationConfig(
crop_image=CroppingStrategy.AUTO,
html=GenerationStrategy.AUTO,
markdown=GenerationStrategy.AUTO,
llm=None
)
)
)
```
You can customize any segment's generation strategy and add optional LLM prompts:
```python Python
# Example with custom LLM prompt for tables
config = Configuration(
segment_processing=SegmentProcessing(
Table=GenerationConfig(
crop_image=CroppingStrategy.AUTO,
html=GenerationStrategy.LLM,
markdown=GenerationStrategy.LLM,
llm="Convert this table to a clear and concise format"
)
)
)
```
### Segmentation Strategy
```python Python
config = Configuration(
segmentation_strategy=SegmentationStrategy.LAYOUT_ANALYSIS # or SegmentationStrategy.PAGE
)
```
# Canceling Tasks
Source: https://docs.chunkr.ai/sdk/data-operations/cancel
Learn how to cancel queued tasks in Chunkr AI
Chunkr AI allows you to cancel tasks that are in queede but haven't started processing. Any task that has status `Starting` can be canceled.
You can cancel tasks either by their ID or using a task object.
## Canceling by Task ID
Use the `cancel_task()` method when you have the task ID:
```python Python
from chunkr_ai import Chunkr
chunkr = Chunkr()
# Cancel task by ID
chunkr.cancel_task("task_123")
```
## Canceling from TaskResponse Object
If you have a task object, you can cancel it directly using the `cancel()` method. This method will also return the updated task status:
```python Python
# Get existing task
task = chunkr.get_task("task_123")
# Cancel the task and get updated status
updated_task = task.cancel()
print(updated_task.status) # Will show canceled status
```
## Async Usage
For async applications, use `await`:
```python Python
# Cancel by ID
await chunkr.cancel_task("task_123")
# Or cancel from task object
task = await chunkr.get_task("task_123")
updated_task = await task.cancel()
```
# Creating Tasks
Source: https://docs.chunkr.ai/sdk/data-operations/create
Learn how to upload files and create processing tasks with Chunkr AI
The Chunkr AI SDK provides two main methods for uploading files:
* `upload()`: Upload and wait for complete processing
* `create_task()`: Upload and get an immediate task response
## Complete Processing with `upload()`
The `upload()` method handles the entire process - it uploads your file and waits for processing to complete:
```python Python
from chunkr_ai import Chunkr
chunkr = Chunkr()
# Upload and wait for complete processing
task = chunkr.upload("path/to/your/file")
# All processing is done - you can access results immediately
print(task.task_id)
print(task.status) # Will be "completed"
print(task.output) # Contains processed results
```
## Instant Response with `create_task()`
If you want to start processing but don't want to wait for completion, use `create_task()`:
```python Python
# Create task without waiting
task = chunkr.create_task("path/to/your/file")
# Task is created but processing may not be complete
print(task.task_id)
print(task.status) # Might be "Starting"
print(task.output) # Might be None if processing isn't finished
```
## Checking Task Status with `poll()`
When using `create_task()`, you can check the status later using `poll()`:
```python Python
# Create task immediately
task = chunkr.create_task("path/to/your/file")
# ... do other work ...
# Check status when needed
result = task.poll()
print(result.status)
print(result.output) # Now contains processed results if status is "Succeeded"
```
For async applications, remember to use `await`:
```python Python
# Create task immediately
task = await chunkr.create_task("path/to/your/file")
# ... do other work ...
# Check status when needed
result = await task.poll()
```
## Supported File Types
We support PDFs, Office files (Word, Excel, PowerPoint), and images. You can upload them in several ways:
```python Python
# From a file path
task = chunkr.upload("path/to/your/file")
# From an opened file
with open("path/to/your/file", "rb") as f:
task = chunkr.upload(f)
# From a URL
task = chunkr.upload("https://example.com/document.pdf")
# From a base64 string
task = chunkr.upload("JVBERi0...")
# From a PIL Image
from PIL import Image
img = Image.open("path/to/your/photo.jpg")
task = chunkr.upload(img)
```
# Deleting Tasks
Source: https://docs.chunkr.ai/sdk/data-operations/delete
Learn how to delete tasks in Chunkr AI
Chunkr AI provides methods to delete tasks when they're no longer needed. Any task that has status `Succeeded` or `Failed` can be deleted.
You can delete tasks either by their ID or using a task object.
## Deleting by Task ID
Use the `delete_task()` method when you have the task ID:
```python Python
from chunkr_ai import Chunkr
chunkr = Chunkr()
# Delete task by ID
chunkr.delete_task("task_123")
```
## Deleting from TaskResponse Object
If you have a task object, you can delete it directly using the `delete()` method:
```python Python
# Get existing task
task = chunkr.get_task("task_123")
# Delete the task
task.delete()
```
## Async Usage
For async applications, remember to use `await`:
```python Python
# Delete by ID
await chunkr.delete_task("task_123")
# Or delete from task object
task = await chunkr.get_task("task_123")
await task.delete()
```
# Getting Tasks
Source: https://docs.chunkr.ai/sdk/data-operations/get
Learn how to retrieve and read task information from Chunkr AI
You can retrieve information about a task at any time using the `get_task()` method
This is useful for checking the status of previously created tasks or accessing their results.
## Basic Usage
```python Python
from chunkr_ai import Chunkr
chunkr = Chunkr()
# Get task by ID
task = chunkr.get_task("task_123")
# Access task information
print(task.status)
print(task.output)
```
## Customizing the Response
The `get_task()` method accepts two optional parameters to customize the response:
```python Python
# Exclude chunks from output
task = chunkr.get_task("task_123", include_chunks=False)
# Get task with base64-encoded URLs instead of presigned URLs
task = chunkr.get_task("task_123", base64_urls=True)
```
## Response Options
| Parameter | Default | Description |
| ---------------- | ------- | ---------------------------------------------------------------------------------------------------------------------------- |
| `include_chunks` | `True` | When `True`, includes all processed chunks in the response. Set to `False` to receive a lighter response without chunk data. |
| `base64_urls` | `False` | When `True`, returns URLs as base64-encoded strings. When `False`, returns presigned URLs for direct access. |
## Async Usage
For async applications, remember to use `await`:
```python Python
# Get task asynchronously
task = await chunkr.get_task("task_123")
```
## Best Practices
* Store task IDs when creating tasks if you need to retrieve them later
* Use `include_chunks=False` when you only need task metadata
* Consider using base64 URLs (`base64_urls=True`) when you need to cache or store the URLs locally
```python Python
# Get task with base64-encoded URLs
task = chunkr.get_task("task_123", base64_urls=True)
# Get task without chunks
task = chunkr.get_task("task_123", include_chunks=False)
```
```bash cURL
# Get task with base64-encoded URLs
curl -X GET "https://api.chunkr.ai/api/v1/task/{task_id}?base64_urls=true" -H "Authorization: YOUR_API_KEY"
# Get task without chunks
curl -X GET "https://api.chunkr.ai/api/v1/task/{task_id}?include_chunks=false" -H "Authorization: YOUR_API_KEY"
```
# Updating Tasks
Source: https://docs.chunkr.ai/sdk/data-operations/update
Learn how to update existing tasks in Chunkr AI
Chunkr AI allows you to update the configuration of existing tasks. You can update a task either by its ID or using a task object.
## Updating by Task ID
Use the `update()` method with a task ID when you have the ID stored:
```python Python
from chunkr_ai import Chunkr, Configuration
chunkr = Chunkr()
# Update task with new configuration
new_config = Configuration(
# your configuration options here
)
# Update and wait for processing
task = chunkr.update("task_123", new_config)
```
## Updating from TaskResponse Object
If you have a task object (from a previous `get_task()` or `create_task()`), you can update it directly:
```python Python
# Get existing task
task = chunkr.get_task("task_123")
# Update configuration
new_config = Configuration(
# your configuration options here
)
# Update and wait
task = task.update(new_config)
```
## Immediate vs. Waited Updates
Like task creation, you have two options for updates:
### Wait for Processing
The standard `update()` method waits for processing to complete:
```python Python
# Updates and waits for completion
task = chunkr.update("task_123", new_config)
print(task.status) # Will be "Succeeded"
```
### Immediate Response
Use `update_task()` for an immediate response without waiting:
```python Python
# Updates and returns immediately
task = chunkr.update_task("task_123", new_config)
print(task.status) # Might be "Starting"
# Get status later
result = task.poll()
```
## Async Usage
For async applications, use `await` with the update methods:
```python Python
# Update and wait
task = await chunkr.update("task_123", new_config)
# Update without waiting
task = await chunkr.update_task("task_123", new_config)
result = await task.poll() # Check status later
```
# Export to different formats
Source: https://docs.chunkr.ai/sdk/export
Chunkr AI allows you to export task results in multiple formats. You can get the content directly and save it to a file.
## Available Export Formats
### HTML Export
This will collate the `html` from every `segment` in all `chunks`, and create a single HTML file.
```python Python
# Get HTML content as string
html = task.html()
# Or save directly to file
html = task.html(output_file="output/result.html")
```
### Markdown Export
This will collate the `markdown` from every `segment` in all `chunks`, and create a single markdown file.
```python Python
# Get markdown content as string
md = task.markdown()
# Or save directly to file
md = task.markdown(output_file="output/result.md")
```
### Text Export
This will collate the `content` from every `segment` in all `chunks`, and create a single text file.
```python Python
# Get plain text content as string
text = task.content()
# Or save directly to file
text = task.content(output_file="output/result.txt")
```
### JSON Export
This will return the complete task data.
```python Python
# Get complete task data as dictionary
json = task.json()
# Or save directly to file
json = task.json(output_file="output/result.json")
```
## File Output
When using the `output_file` parameter:
* Directories in the path will be created automatically if they don't exist
* Files are written with UTF-8 encoding
* Existing files will be overwritten
Example with custom path:
```python Python
# Create nested directory structure and save file
task.html(output_file="exports/2024/january/result.html")
```
# Installation
Source: https://docs.chunkr.ai/sdk/installation
### Step 1: Sign Up and Create an API Key
1. Visit [Chunkr AI](https://chunkr.ai)
2. Click on "Login" and create your account
3. Once logged in, navigate to "API Keys" in the dashboard
For self-hosted deployments:
* [Docker Compose Setup Guide](https://github.com/lumina-ai-inc/chunkr?tab=readme-ov-file#quick-start-with-docker-compose)
* [Kubernetes Setup Guide](https://github.com/lumina-ai-inc/chunkr/tree/main/kube)
### Step 2: Install our client SDK
```python Python
pip install chunkr-ai
```
### Step 3: Upload your document
```python Python
from chunkr_ai import Chunkr
# Initialize the Chunkr client with your API key - get this from https://chunkr.ai
chunkr = Chunkr(api_key="your_api_key")
# Upload a document via url or local file path
url = "https://chunkr-web.s3.us-east-1.amazonaws.com/landing_page/input/science.pdf"
task = chunkr.upload(url)
```
### Step 4: Clean up
You can clean up the open connections by calling the `close()` method on the `Chunkr` client.
```python Python
chunkr.close()
```
## Environment Setup
You can authenticate with the Chunkr AI API in two ways:
1. **Direct API Key** - Pass your API key directly when initializing the client
2. **Environment Variable** - Set `CHUNKR_API_KEY` in your `.env` file
You can also configure the API endpoint:
1. **Direct URL** - Pass your API URL when initializing the client
2. **Environment Variable** - Set `CHUNKR_URL` in your `.env` file
This is particularly useful if you're running a self-hosted version of Chunkr.
```python Python
from chunkr_ai import Chunkr
# Option 1: Initialize with API key directly
chunkr = Chunkr(api_key="your_api_key")
# Option 2: Initialize without api_key parameter - will use CHUNKR_API_KEY from environment
chunkr = Chunkr()
# Option 3: Configure custom API endpoint
chunkr = Chunkr(api_key="your_api_key", chunkr_url="http://localhost:8000")
# Option 4: Use environment variables for both API key and URL
chunkr = Chunkr() # will use CHUNKR_API_KEY and CHUNKR_URL from environment
```
# Polling the TaskResponse
Source: https://docs.chunkr.ai/sdk/polling
The Chunkr AI API follows a task-based pattern where you create a task and monitor its progress through polling.
The `poll()` method handles this by automatically checking the task's status at regular intervals until it transitions out of the `Starting` or `Processing` states.
After polling completes, it's important to verify the final task status, which will be one of:
* `Succeeded`: Task completed successfully
* `Failed`: Task encountered an error
* `Cancelled`: Task was manually cancelled
## Synchronous Usage
When you have a `TaskResponse` object, you can poll it.
Look at [creating](/sdk/data-operations/create) and [getting](/sdk/data-operations/get) a task for more information on how to get a `TaskResponse` object.
```python Python
from chunkr_ai import Chunkr
# Initialize client
chunkr = Chunkr()
try:
# Given that you already have a task object, you can poll it
task.poll()
print(task.output.chunks)
finally:
# Clean up when done
chunkr.close()
```
## Asynchronous Usage
For async applications, use `await`:
```python Python
from chunkr_ai import Chunkr
import asyncio
async def process_document():
# Initialize client
chunkr = Chunkr()
try:
# Given that you already have a task object
await task.poll()
print(task.output.chunks)
finally:
# Clean up when done
chunkr.close()
```
## Error Handling
By default, failed tasks i.e. `task.status == "Failed"` will not raise exceptions. You can configure this behavior using the `raise_on_failure` parameter when initializing the client:
```python Python
from chunkr_ai import Chunkr
# Initialize client with automatic error raising
chunkr = Chunkr(raise_on_failure=True)
```
# Using Chunkr AI SDK
Source: https://docs.chunkr.ai/sdk/usage
Chunkr AI's SDK supports both synchronous and asynchronous usage patterns.
The same client class `Chunkr` can be used for both patterns, making it flexible for different application needs.
All methods exists in both synchronous and asynchronous versions.
## Synchronous Usage
For simple scripts or applications that don't require asynchronous operations, you can use the synchronous pattern:
```python Python
from chunkr_ai import Chunkr
# Initialize client
chunkr = Chunkr()
try:
# Upload a file and wait for processing
task = chunkr.upload("document.pdf")
print(task.task_id)
# Alternatively, create task without waiting - you will get back a task object without chunks
task = chunkr.create_task("document.pdf")
# Poll the task when ready - this will wait for the task to complete and return a task object with chunks
task.poll()
print(task.output.chunks)
finally:
# Clean up when done
chunkr.close()
```
## Asynchronous Usage
For applications that benefit from asynchronous operations (like web servers or background tasks), you can use the async pattern:
```python Python
from chunkr_ai import Chunkr
import asyncio
async def process_document():
# Initialize client
chunkr = Chunkr()
try:
# Upload a file and wait for processing
task = await chunkr.upload("document.pdf")
print(task.task_id)
# Alternatively, create task without waiting - you will get back a task object without chunks
task = await chunkr.create_task("document.pdf")
# Poll the task when ready - this will wait for the task to complete and return a task object with chunks
await task.poll()
print(task.output.chunks)
finally:
# Clean up when done
chunkr.close()
```