POST
/
api
/
v1
/
task
/
parse
curl --request POST \
  --url https://api.chunkr.ai/api/v1/task/parse \
  --header 'Authorization: <api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
  "chunk_processing": null,
  "expires_in": 123,
  "file": "<string>",
  "file_name": "<string>",
  "high_resolution": false,
  "ocr_strategy": null,
  "pipeline": null,
  "segment_processing": null,
  "segmentation_strategy": null
}'
{
  "configuration": {
    "chunk_processing": {
      "ignore_headers_and_footers": true,
      "target_length": 512
    },
    "expires_in": 123,
    "high_resolution": true,
    "input_file_url": "<string>",
    "json_schema": "<any>",
    "model": null,
    "ocr_strategy": "All",
    "pipeline": null,
    "segment_processing": {
      "Caption": null,
      "Footnote": null,
      "Formula": null,
      "ListItem": null,
      "Page": null,
      "PageFooter": null,
      "PageHeader": null,
      "Picture": null,
      "SectionHeader": null,
      "Table": null,
      "Text": null,
      "Title": null
    },
    "segmentation_strategy": "LayoutAnalysis",
    "target_chunk_length": 123
  },
  "created_at": "2023-11-07T05:31:56Z",
  "expires_at": "2023-11-07T05:31:56Z",
  "finished_at": "2023-11-07T05:31:56Z",
  "message": "<string>",
  "output": null,
  "started_at": "2023-11-07T05:31:56Z",
  "status": "Starting",
  "task_id": "<string>",
  "task_url": "<string>"
}

Authorizations

Authorization
string
header
required

Body

application/json
JSON request to create a task
file
string
required

The file to be uploaded. Can be a URL or a base64 encoded file.

chunk_processing
object | null

Controls the setting for the chunking and post-processing of each chunk.

expires_in
integer | null

The number of seconds until task is deleted. Expried tasks can not be updated, polled or accessed via web interface.

file_name
string | null

The name of the file to be uploaded. If not set a name will be generated.

high_resolution
boolean | null
default:
false

Whether to use high-resolution images for cropping and post-processing. (Latency penalty: ~7 seconds per page)

ocr_strategy
enum<string> | null
default:
All

Controls the Optical Character Recognition (OCR) strategy.

  • All: Processes all pages with OCR. (Latency penalty: ~0.5 seconds per page)
  • Auto: Selectively applies OCR only to pages with missing or low-quality text. When text layer is present the bounding boxes from the text layer are used.
Available options:
All,
Auto
pipeline
enum<string> | null

The PipelineType to use for processing. If pipeline is set to Azure then Azure layout analysis will be used for segmentation and OCR. The output will be unified to the Chunkr output format.

Available options:
Azure
segment_processing
object | null

Controls the post-processing of each segment type. Allows you to generate HTML and Markdown from chunkr models for each segment type. By default, the HTML and Markdown are generated manually using the segmentation information except for Table and Formula. You can optionally configure custom LLM prompts and models to generate an additional llm field with LLM-processed content for each segment type.

segmentation_strategy
enum<string> | null
default:
LayoutAnalysis

Controls the segmentation strategy:

  • LayoutAnalysis: Analyzes pages for layout elements (e.g., Table, Picture, Formula, etc.) using bounding boxes. Provides fine-grained segmentation and better chunking. (Latency penalty: ~TBD seconds per page).
  • Page: Treats each page as a single segment. Faster processing, but without layout element detection and only simple chunking.
Available options:
LayoutAnalysis,
Page

Response

200
application/json
Detailed information describing the task, its status and processed outputs
configuration
object
required

The configuration used for the task.

created_at
string
required

The date and time when the task was created and queued.

message
string
required

A message describing the task's status or any errors that occurred.

status
enum<string>
required

The status of the task.

Available options:
Starting,
Processing,
Succeeded,
Failed,
Cancelled
task_id
string
required

The unique identifier for the task.

expires_at
string | null

The date and time when the task will expire.

finished_at
string | null

The date and time when the task was finished.

output
object | null

The processed results of a document analysis task

started_at
string | null

The date and time when the task was started.

task_url
string | null

The presigned URL of the task.