# Health Check Source: https://docs.chunkr.ai/api-references/health/health-check https://api.chunkr.ai/docs/openapi.json get /health Confirmation that the service can respond to requests # Cancel Task Source: https://docs.chunkr.ai/api-references/task/cancel-task https://api.chunkr.ai/docs/openapi.json get /api/v1/task/{task_id}/cancel Cancel a task that hasn't started processing yet: - For new tasks: Status will be updated to `Cancelled` - For updating tasks: Task will revert to the previous state Requirements: - Task must have status `Starting` # Create Task Source: https://docs.chunkr.ai/api-references/task/create-task https://api.chunkr.ai/docs/openapi.json post /api/v1/task/parse Queues a document for processing and returns a TaskResponse containing: - Task ID for status polling - Initial configuration - File metadata - Processing status - Creation timestamp - Presigned URLs for file access The returned task will typically be in a `Starting` or `Processing` state. Use the `GET /task/{task_id}` endpoint to poll for completion. # Delete Task Source: https://docs.chunkr.ai/api-references/task/delete-task https://api.chunkr.ai/docs/openapi.json delete /api/v1/task/{task_id} Delete a task by its ID. Requirements: - Task must have status `Succeeded` or `Failed` # Get Task Source: https://docs.chunkr.ai/api-references/task/get-task https://api.chunkr.ai/docs/openapi.json get /api/v1/task/{task_id} Retrieves detailed information about a task by its ID, including: - Processing status - Task configuration - Output data (if processing is complete) - File metadata (name, page count) - Timestamps (created, started, finished) - Presigned URLs for accessing files This endpoint can be used to: 1. Poll the task status during processing 2. Retrieve the final output once processing is complete 3. Access task metadata and configuration # Update Task Source: https://docs.chunkr.ai/api-references/task/update-task https://api.chunkr.ai/docs/openapi.json patch /api/v1/task/{task_id}/parse Updates an existing task's configuration and reprocesses the document. The original configuration will be used for all values that are not provided in the update. Requirements: - Task must have status `Succeeded` or `Failed` - New configuration must be different from the current one The returned task will typically be in a `Starting` or `Processing` state. Use the `GET /task/{task_id}` endpoint to poll for completion. # Get Tasks Source: https://docs.chunkr.ai/api-references/tasks/get-tasks https://api.chunkr.ai/docs/openapi.json get /api/v1/tasks Retrieves a list of tasks Example usage: `GET /api/v1/tasks?page=1&limit=10&include_chunks=false` # Chunking Source: https://docs.chunkr.ai/docs/features/chunking Chunking Chunking is the process of splitting a document into smaller segments. These chunks can be used for semantic search, and better LLM performance. By leveraging layout analysis, we create intelligent chunks that preserve document structure and context. Our algorithm: * Respects natural document boundaries (paragraphs, sections) * Maintains semantic relationships between segments * Optimizes chunk size for LLM processing You can review the implementation of our chunking algorithm in our [GitHub repository](https://github.com/lumina-ai-inc/chunkr/blob/main/core/src/utils/services/chunking.rs#L113). Here is an example that will chunk the document into 512 words per chunks. These values are also the defaults, so you don't need to specify them. ```python Python from chunkr_ai import Chunkr from chunkr_ai.models import ( ChunkProcessing, Configuration Tokenizer, ) chunkr = Chunkr() chunkr.upload("path/to/file", Configuration( chunk_processing=ChunkProcessing( ignore_headers_and_footers=True, target_length=512, tokenizer=Tokenizer.WORD ), )) ``` ```bash cURL curl --request POST \ --url https://api.chunkr.ai/api/v1/task/parse \ --header 'Authorization: YOUR_API_KEY' \ --header 'Content-Type: application/json' \ --data '{ "file": "base64_encoded_file_content", "file_name": "document.pdf", "chunk_processing": { "ignore_headers_and_footers": false, "target_length": 512, "tokenizer": { "Enum": "Word" } } }' ``` ### Defaults * `ignore_headers_and_footers`: True * `target_length`: 512 * `tokenizer`: `Word` ## Tokenizer Chunkr supports a large number of tokenizers. You can use our predefined ones or specify any tokenizer from huggingface. ### Predefined Tokenizers The predefined tokenizers are enum values and can be used as follows: ```python Python from chunkr_ai import Chunkr from chunkr_ai.models import ( ChunkProcessing, Configuration Tokenizer, ) chunkr = Chunkr() chunkr.upload("path/to/file", Configuration( chunk_processing=ChunkProcessing( tokenizer=Tokenizer.CL100K_BASE ), )) ``` ```bash cURL curl --request POST \ --url https://api.chunkr.ai/api/v1/task/parse \ --header 'Authorization: YOUR_API_KEY' \ --header 'Content-Type: application/json' \ --data '{ "file": "base64_encoded_file_content", "file_name": "document.pdf", "chunk_processing": { "tokenizer": { "Enum": "Cl100kBase" } } }' ``` Available options: * `Word`: Split by words * `Cl100kBase`: For OpenAI models (e.g. GPT-3.5, GPT-4, text-embedding-ada-002) * `XlmRobertaBase`: For RoBERTa-based multilingual models * `BertBaseUncased`: BERT base uncased tokenizer You can also define the tokenizer enum as a string in the python SDK. Here is an example where the string will be converted to the enum value. ```python Python from chunkr_ai import Chunkr from chunkr_ai.models import ( ChunkProcessing, Configuration Tokenizer, ) chunkr = Chunkr() chunkr.upload("path/to/file", Configuration( chunk_processing=ChunkProcessing( tokenizer="Word" ), )) ``` ### Hugging Face Tokenizers Use any Hugging Face tokenizer by providing its model ID as a string (e.g. "facebook/bart-large", "Qwen/Qwen-tokenizer", etc.) ```python Python from chunkr_ai import Chunkr from chunkr_ai.models import ( ChunkProcessing, Configuration Tokenizer, ) chunkr = Chunkr() chunkr.upload("path/to/file", Configuration( chunk_processing=ChunkProcessing( tokenizer="Qwen/Qwen-tokenizer" ), )) ``` ```bash cURL curl --request POST \ --url https://api.chunkr.ai/api/v1/task/parse \ --header 'Authorization: YOUR_API_KEY' \ --header 'Content-Type: application/json' \ --data '{ "file": "base64_encoded_file_content", "file_name": "document.pdf", "chunk_processing": { "tokenizer": { "String": "Qwen/Qwen-tokenizer" } } }' ``` ## Calculating Chunk Lengths With Embed Sources When calculating chunk lengths and performing tokenization, we use the text from the `embed` field in each chunk object. This field contains the text that will be compared against the target length. You can configure what text goes into the `embed` field by setting the `embed_sources` parameter in your segment processing configuration. This parameter is specified under `segment_processing.{segment_type}` in your configuration. You can see more information about the `embed_sources` parameter in the [Segment Processing](/features/segment-processing) section. Here's an example of customizing the `embed` field content for Picture segments. By configuring `embed_sources`, you can include both the LLM-generated output and Chunkr's markdown output in the `embed` field for Pictures, while other segment types will continue using just the default Markdown content. Additionally, we can use `CL100K_BASE` tokenizer to configure this for OpenAI models. This means for this configuration, when calculating chunk lengths: * Picture segments: Length will be based on both the LLM summary and Markdown content * All other segments: Length will be based only on the Markdown content * The tokenizer will be `CL100K_BASE` ```python Python from chunkr_ai import Chunkr from chunkr_ai.models import ( ChunkProcessing, Configuration, SegmentProcessing, Tokenizer, ) chunkr = Chunkr() chunkr.upload("path/to/file", Configuration( chunk_processing=ChunkProcessing( tokenizer=Tokenizer.CL100K_BASE ), segment_processing=SegmentProcessing( Picture=SegmentProcessingPicture( llm="Summarize the key information presented", embed_sources=[EmbedSource.MARKDOWN, EmbedSource.LLM] ) ) )) ``` ```bash cURL curl --request POST \ --url https://api.chunkr.ai/api/v1/task/parse \ --header 'Authorization: YOUR_API_KEY' \ --header 'Content-Type: application/json' \ --data '{ "file": "base64_encoded_file_content", "file_name": "document.pdf", "chunk_processing": { "tokenizer": { "Enum": "Cl100kBase" } }, "segment_processing": { "Picture": { "llm": "Summarize the key information presented", "embed_sources": ["Markdown", "LLM"] } } }' ``` By combining the `embed_sources` parameter with the `tokenizer` parameter, you can customize the chunk lengths and tokenization for different segment types. This allows you to have very powerful chunking configurations for your documents. # Error Handling Source: https://docs.chunkr.ai/docs/features/error-handling Handle errors in Chunkr AI API Chunkr AI API provides a configurable approach to error handling during document processing. You can control how the system responds to errors that occur during various processing stages. ## Error Handling Strategy The `ErrorHandlingStrategy` configuration allows you to specify how the system should respond when errors occur during document processing. ```python from chunkr_ai import Chunkr from chunkr_ai.models import Configuration, ErrorHandlingStrategy # Create config with Continue error handling config = Configuration( error_handling=ErrorHandlingStrategy.CONTINUE ) # Upload document with this configuration chunkr = Chunkr() response = await chunkr.upload("path/to/file", config) ``` ### Available Strategies | Strategy | Description | | -------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | | `ErrorHandlingStrategy.FAIL` | Default behavior. Processing stops immediately when any error occurs. The task will fail and return status `FAILED`. | | `ErrorHandlingStrategy.CONTINUE` | Processing continues despite non-critical errors. The system will make reasonable attempts to recover and continue with the remaining content. | ## How Continue Mode Works When you set `error_handling=ErrorHandlingStrategy.CONTINUE`, the system will attempt to gracefully handle various types of errors: ### LLM Processing Errors If a segment encounters an LLM error: * The system will skip that specific segment instead of failing the entire task * Processing continues with the remaining segments If there is a fallback model configured, the system will use it to process the segment first, if that fails then error handling will continue. See here in [LLM Processing](/docs/features/llm-processing) how to configure a fallback model. ### Layout Analysis Errors When using `Continue` mode during layout analysis: * If a page encounters layout detection problems, it defaults to segment type "Page" * This ensures the content is still accessible even if optimal segmentation fails * The page's content will be processed as a single segment ### OCR Strategy Fallbacks In `Continue` mode with OCR processing: * If OCR extraction encounters errors, it falls back to using the document's text layer * This behaves similarly to `OcrStrategy.AUTO` mode * Ensures text content is still available even when OCR processing fails ## Example Usage ```python Basic Example from chunkr_ai import Chunkr from chunkr_ai.models import Configuration, ErrorHandlingStrategy chunkr = Chunkr() # Use Continue strategy for robust processing config = Configuration( error_handling=ErrorHandlingStrategy.CONTINUE ) response = await chunkr.upload("path/to/document.pdf", config) # Processing will continue despite non-critical errors ``` ```python Combined with Other Settings from chunkr_ai import Chunkr from chunkr_ai.models import ( Configuration, ErrorHandlingStrategy, LlmProcessing, FallbackStrategy ) chunkr = Chunkr() # Comprehensive configuration with error handling config = Configuration( error_handling=ErrorHandlingStrategy.CONTINUE, llm_processing=LlmProcessing( model_id="gemini-pro-2.5", fallback_strategy=FallbackStrategy.model("claude-3.7-sonnet"), max_completion_tokens=4096 ) ) response = await chunkr.upload("path/to/document.pdf", config) # Will continue processing even if some LLM calls fail ``` ```bash cURL curl -X POST \ --url https://api.chunkr.ai/api/v1/task/parse \ --header "Authorization: YOUR_API_KEY" \ --header "Content-Type: application/json" \ --data '{ "file": "base64_or_url_to_file", "error_handling": "CONTINUE", "llm_processing": { "fallback_strategy": {"model": "gemini-flash-2.0"}, "model_id": "claude-3.7-sonnet" } }' ``` ## When to Use Continue Mode Consider using `ErrorHandlingStrategy.CONTINUE` when partial results are better than completely failed processing. For critical applications where accuracy is paramount, you may prefer the default `ErrorHandlingStrategy.FAIL` to ensure you're alerted to any processing issues. # Segmentation Strategy Source: https://docs.chunkr.ai/docs/features/layout-analysis/segmentation_strategy Controls the segmentation strategy The chunkr AI API allows you to specify a `segmentation_strategy` for each document. This strategy controls how the document is segmented. We have two strategies: * `LayoutAnalysis`: Run our state-of-the-art layout analysis model to identify the layout elements. This is the default strategy. * `Page`: Each segment is a page. This is how you can configure the segmentation strategy: ```python Python from chunkr_ai import Chunkr from chunkr_ai.models import Configuration, SegmentationStrategy chunkr = Chunkr() chunkr.upload("path/to/file", Configuration( segmentation_strategy=SegmentationStrategy.LAYOUT_ANALYSIS )) ``` ```bash cURL curl --request POST \ --url https://api.chunkr.ai/api/v1/task/parse \ --header 'Authorization: YOUR_API_KEY' \ --header 'Content-Type: application/json' \ --data '{ "file": "base64_encoded_file_content", "file_name": "document.pdf", "segmentation_strategy": "LayoutAnalysis" }' ``` ## When to use each strategy For most documents, we recommend using the `LayoutAnalysis` strategy. This will give you the best results. Use `Page` for: * Faster processing speed when you need quick results and layout isn't critical * Documents with unusual layouts that confuse the layout analysis model * If the layout is complex but not very information dense, `Page` + VLM can generate surprisingly good HTML and markdown (see [Segment Processing](/docs/features/segment-processing)). # What is Layout Analysis? Source: https://docs.chunkr.ai/docs/features/layout-analysis/what Understand the importance of layout analysis in document processing Layout analysis is a crucial step in document processing that involves analyzing and understanding the spatial arrangement of content within a document. It helps identify and classify different regions of a document, such as `text`, `table`, `headers`, `footers`, and `pictures`. Basically, it tells us where and what is in the document. ## Why is Layout Analysis Important? Layout analysis serves several key purposes: * **Structure Recognition**: It helps identify the logical structure and reading order of a document * **Data Extraction**: By identifying specific regions (like tables, headers, or paragraphs), we can use specialized extraction methods for each type, improving accuracy * **Better Chunking**: Layout elements allows us to identify sections of the document and generate better chunks. * **Citations**: It allows LLMs to cite the correct region of the document, which can then be highlighted for a better experience. Layout Analysis ## Segment Types Chunkr uses a two way vision-grid transformer to identify the layout of the document. We support the following segment types: * **Caption**: Text describing figures, tables, or other visual elements * **Footnote**: References or additional information at the bottom of pages * **Formula**: Mathematical or scientific equations * **List Item**: Individual items in bulleted or numbered lists * **Page**: Entire page (`segmentation_strategy=Page`) * **Page Footer**: Content that appears at the bottom of each page * **Page Header**: Content that appears at the top of each page * **Picture**: Images, diagrams, or other visual elements * **Section Header**: Headers that divide the document into sections * **Table**: Structured data arranged in rows and columns * **Text**: Regular paragraph text * **Title**: Main document title # LLM Processing Source: https://docs.chunkr.ai/docs/features/llm-processing Process documents with LLMs Chunkr AI API allows you to configure the LLMs that will be used to process the documents. The LLMs configuration is applied to your segments during the `segment_processing` step, click [here](./segment-processing) to learn more. This is how you can configure the LLMs: ```python llm_processing=LlmProcessing( model_id="gemini-pro-2.5", fallback_strategy=FallbackStrategy.model("gemini-flash-2.0"), max_completion_tokens=4096, temperature=0.0 ) ``` ## LLM Processing Options The `LlmProcessing` configuration controls which language models are used for processing segments and provides fallback strategies if the primary model fails. | Field | Type | Description | Default | | ----------------------- | ---------------- | -------------------------------------------------------------------------------------------------- | ----------------------------------- | | `model_id` | String | The ID of the model to use for processing. If not provided, the system default model will be used. | [System default](#available-models) | | `fallback_strategy` | FallbackStrategy | Strategy to use if the primary model fails. | [System default](#available-models) | | `max_completion_tokens` | Integer | Maximum number of tokens to generate in the model response. | None | | `temperature` | Float | Controls randomness in model responses (0.0 = deterministic, higher = more random). | 0.0 | ## Fallback Strategies When working with language models, reliability is important. Chunkr provides three fallback strategies to handle cases when your primary model fails: * `FallbackStrategy.none()`: No fallback will be used. If the primary model fails, the operation will return an error. * `FallbackStrategy.default()`: Use the system default fallback model. * `FallbackStrategy.model("model-id")`: Specify a particular model ID to use as a fallback. This gives you explicit control over which alternative model should be used. ## Example Usage Here's how to configure LLM processing in different scenarios: ```python Simple Configuration from chunkr_ai import Chunkr from chunkr_ai.models import ( Configuration, LlmProcessing, FallbackStrategy ) chunkr = Chunkr() # Use Gemini Pro 2.5 with no fallback strategy config = Configuration( llm_processing=LlmProcessing( model_id="gemini-pro-2.5", fallback_strategy=FallbackStrategy.none(), temperature=0.0 ) ) chunkr.upload("path/to/file", config) ``` ```python With Fallback Model from chunkr_ai import Chunkr from chunkr_ai.models import ( Configuration, LlmProcessing, FallbackStrategy ) chunkr = Chunkr() # Use Claude 3.7 Sonnet with Gemini Flash 2.0 as fallback config = Configuration( llm_processing=LlmProcessing( model_id="claude-3.7-sonnet", fallback_strategy=FallbackStrategy.model("gemini-flash-2.0"), max_completion_tokens=4096, temperature=0.2 ) ) chunkr.upload("path/to/file", config) ``` ```bash cURL curl -X POST \ --url https://api.chunkr.ai/api/v1/task/parse \ --header "Authorization: YOUR_API_KEY" \ --header "Content-Type: application/json" \ --data '{ "file": "base64_or_url_to_file", "llm_processing": { "fallback_strategy": {"model": "gemini-flash-2.0"}, "max_completion_tokens": 4096, "model_id": "gemini-flash-2.0", "temperature": 0 } }' ``` ## Available Models The following models are currently available for use with Chunkr: