Update Task Multipart

PATCH

api

task

{task_id}

curl --request PATCH \
  --url https://api.chunkr.ai/api/v1/task/{task_id} \
  --header 'Authorization: <api-key>' \
  --header 'Content-Type: multipart/form-data' \
  --form chunk_processing=null \
  --form expires_in=123 \
  --form high_resolution=true \
  --form ocr_strategy=null \
  --form pipeline=null \
  --form segment_processing=null \
  --form segmentation_strategy=null

{
  "configuration": {
    "chunk_processing": {
      "ignore_headers_and_footers": true,
      "target_length": 512
    },
    "expires_in": 123,
    "high_resolution": true,
    "input_file_url": "<string>",
    "json_schema": "<any>",
    "model": null,
    "ocr_strategy": "All",
    "pipeline": null,
    "segment_processing": {
      "Caption": null,
      "Footnote": null,
      "Formula": null,
      "ListItem": null,
      "Page": null,
      "PageFooter": null,
      "PageHeader": null,
      "Picture": null,
      "SectionHeader": null,
      "Table": null,
      "Text": null,
      "Title": null
    },
    "segmentation_strategy": "LayoutAnalysis",
    "target_chunk_length": 123
  },
  "created_at": "2023-11-07T05:31:56Z",
  "expires_at": "2023-11-07T05:31:56Z",
  "finished_at": "2023-11-07T05:31:56Z",
  "message": "<string>",
  "output": null,
  "started_at": "2023-11-07T05:31:56Z",
  "status": "Starting",
  "task_id": "<string>",
  "task_url": "<string>"
}

Authorizations

Authorization

string

header

required

Path Parameters

task_id

string

required

Body

multipart/form-data

Multipart form request to update an task

chunk_processing

object | null

Controls the setting for the chunking and post-processing of each chunk.

expires_in

integer | null

The number of seconds until task is deleted. Expried tasks can not be updated, polled or accessed via web interface.

high_resolution

boolean | null

Whether to use high-resolution images for cropping and post-processing. (Latency penalty: ~7 seconds per page)

ocr_strategy

enum<string> | null

Controls the Optical Character Recognition (OCR) strategy.

All: Processes all pages with OCR. (Latency penalty: ~0.5 seconds per page)
Auto: Selectively applies OCR only to pages with missing or low-quality text. When text layer is present the bounding boxes from the text layer are used.

Available options:

All,

Auto

pipeline

enum<string> | null

The pipeline to use for processing. If pipeline is set to Azure then Azure layout analysis will be used for segmentation and OCR. The output will be unified to the Chunkr output.

Available options:

Azure

segment_processing

object | null

Controls the post-processing of each segment type. Allows you to generate HTML and Markdown from chunkr models for each segment type. By default, the HTML and Markdown are generated manually using the segmentation information except for Table and Formula. You can optionally configure custom LLM prompts and models to generate an additional llm field with LLM-processed content for each segment type.

segment_processing.Caption

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

segment_processing.Footnote

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

segment_processing.Formula

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

segment_processing.ListItem

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

segment_processing.Page

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

segment_processing.PageFooter

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

segment_processing.PageHeader

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

segment_processing.Picture

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

segment_processing.SectionHeader

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

segment_processing.Table

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

segment_processing.Text

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

segment_processing.Title

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

segmentation_strategy

enum<string> | null

Controls the segmentation strategy:

LayoutAnalysis: Analyzes pages for layout elements (e.g., Table, Picture, Formula, etc.) using bounding boxes. Provides fine-grained segmentation and better chunking. (Latency penalty: ~TBD seconds per page).
Page: Treats each page as a single segment. Faster processing, but without layout element detection and only simple chunking.

Available options:

LayoutAnalysis,

Page

Response

200

application/json

Detailed information describing the task, its status and processed outputs

configuration

object

required

The configuration used for the task.

configuration.chunk_processing

object

required

Controls the setting for the chunking and post-processing of each chunk.

configuration.high_resolution

boolean

required

Whether to use high-resolution images for cropping and post-processing.

configuration.ocr_strategy

enum<string>

required

Controls the Optical Character Recognition (OCR) strategy.

All: Processes all pages with OCR. (Latency penalty: ~0.5 seconds per page)
Auto: Selectively applies OCR only to pages with missing or low-quality text. When text layer is present the bounding boxes from the text layer are used.

Available options:

All,

Auto

configuration.segment_processing

object

required

configuration.segment_processing.Caption

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

configuration.segment_processing.Footnote

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

configuration.segment_processing.Formula

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

configuration.segment_processing.ListItem

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

configuration.segment_processing.Page

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

configuration.segment_processing.PageFooter

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

configuration.segment_processing.PageHeader

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

configuration.segment_processing.Picture

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

configuration.segment_processing.SectionHeader

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

configuration.segment_processing.Table

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

configuration.segment_processing.Text

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

configuration.segment_processing.Title

object | null

Controls the processing and generation for the segment.

crop_image controls whether to crop the file's images to the segment's bounding box. The cropped image will be stored in the segment's image field. Use All to always crop, or Auto to only crop when needed for post-processing.
html is the HTML output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)
llm is the LLM-generated output for the segment, this uses off-the-shelf models to generate a custom output for the segment
markdown is the Markdown output for the segment, generated either through huerstics (Auto) or using Chunkr fine-tuned models (LLM)

configuration.segmentation_strategy

enum<string>

required

Controls the segmentation strategy:

LayoutAnalysis: Analyzes pages for layout elements (e.g., Table, Picture, Formula, etc.) using bounding boxes. Provides fine-grained segmentation and better chunking. (Latency penalty: ~TBD seconds per page).
Page: Treats each page as a single segment. Faster processing, but without layout element detection and only simple chunking.

Available options:

LayoutAnalysis,

Page

configuration.expires_in

integer | null

The number of seconds until task is deleted. Expried tasks can not be updated, polled or accessed via web interface.

configuration.input_file_url

string | null

The presigned URL of the input file.

configuration.json_schema

any

deprecated

configuration.model

enum<string> | null

deprecated

Available options:

Fast,

HighQuality

configuration.pipeline

enum<string> | null

Available options:

Azure

configuration.target_chunk_length

integer | null

deprecated

The target number of words in each chunk. If 0, each chunk will contain a single segment.

created_at

string

required

The date and time when the task was created and queued.

message

string

required

A message describing the task's status or any errors that occurred.

status

enum<string>

required

The status of the task.

Available options:

Starting,

Processing,

Succeeded,

Failed,

Cancelled

task_id

string

required

The unique identifier for the task.

expires_at

string | null

The date and time when the task will expire.

finished_at

string | null

The date and time when the task was finished.

output

object | null

The processed results of a document analysis task

output.chunks

object[]

required

Collection of document chunks, where each chunk contains one or more segments

output.chunks.chunk_length

integer

required

The total number of words in the chunk.

Required range: x > 0

output.chunks.segments

object[]

required

Collection of document segments that form this chunk. When target_chunk_length > 0, contains the maximum number of segments that fit within that length (segments remain intact). Otherwise, contains exactly one segment.

output.chunks.segments.bbox

object

required

Bounding box for an item. It is used for chunks, segments and OCR results.

output.chunks.segments.page_height

number

required

Height of the page containing the segment.

output.chunks.segments.page_number

integer

required

Page number of the segment.

Required range: x > 0

output.chunks.segments.page_width

number

required

Width of the page containing the segment.

output.chunks.segments.segment_id

string

required

Unique identifier for the segment.

output.chunks.segments.segment_type

enum<string>

required

All the possible types for a segment. Note: Different configurations will produce different types. Please refer to the documentation for more information.

Available options:

Caption,

Footnote,

Formula,

ListItem,

Page,

PageFooter,

PageHeader,

Picture,

SectionHeader,

Table,

Text,

Title

output.chunks.segments.confidence

number | null

Confidence score of the layout analysis model

output.chunks.segments.content

string

Text content of the segment.

output.chunks.segments.html

string

HTML representation of the segment.

output.chunks.segments.image

string | null

Presigned URL to the image of the segment.

output.chunks.segments.llm

string | null

LLM representation of the segment.

output.chunks.segments.markdown

string

Markdown representation of the segment.

output.chunks.segments.ocr

object[] | null

OCR results for the segment.

OCR results for a segment

output.chunks.chunk_id

string

The unique identifier for the chunk.

output.chunks.embed

string | null

Suggested text to be embed for search.

output.extracted_json

any

deprecated

The extracted JSON from the document.

output.file_name

string | null

The name of the file.

output.page_count

integer | null

The number of pages in the file.

Required range: x > 0

output.pdf_url

string | null

The presigned URL of the PDF file.

started_at

string | null

The date and time when the task was started.

task_url

string | null

The presigned URL of the task.

Delete Task Cancel Task

curl --request PATCH \
  --url https://api.chunkr.ai/api/v1/task/{task_id} \
  --header 'Authorization: <api-key>' \
  --header 'Content-Type: multipart/form-data' \
  --form chunk_processing=null \
  --form expires_in=123 \
  --form high_resolution=true \
  --form ocr_strategy=null \
  --form pipeline=null \
  --form segment_processing=null \
  --form segmentation_strategy=null

{
  "configuration": {
    "chunk_processing": {
      "ignore_headers_and_footers": true,
      "target_length": 512
    },
    "expires_in": 123,
    "high_resolution": true,
    "input_file_url": "<string>",
    "json_schema": "<any>",
    "model": null,
    "ocr_strategy": "All",
    "pipeline": null,
    "segment_processing": {
      "Caption": null,
      "Footnote": null,
      "Formula": null,
      "ListItem": null,
      "Page": null,
      "PageFooter": null,
      "PageHeader": null,
      "Picture": null,
      "SectionHeader": null,
      "Table": null,
      "Text": null,
      "Title": null
    },
    "segmentation_strategy": "LayoutAnalysis",
    "target_chunk_length": 123
  },
  "created_at": "2023-11-07T05:31:56Z",
  "expires_at": "2023-11-07T05:31:56Z",
  "finished_at": "2023-11-07T05:31:56Z",
  "message": "<string>",
  "output": null,
  "started_at": "2023-11-07T05:31:56Z",
  "status": "Starting",
  "task_id": "<string>",
  "task_url": "<string>"
}

Task

Tasks

Health

Update Task Multipart

Authorizations

Path Parameters

Body

Response