The chunkr api works by creating a task for each document you want to process, and then processing them asynchronously. Both creating and getting tasks return a task object.

Task Object

Both creating and getting tasks return a task object. It contains the information about the file you are processing as well as the output when the task is successful.

interface TaskResponse {
  configuration: {
    model: Model;
    ocr_strategy: OcrStrategy;
    target_chunk_length: number;
  };
  created_at: string;
  expires_at?: string;
  file_name: string;
  finished_at?: string;
  input_file_url: string;
  message: string;
  output?: OutputResponse;
  page_count: number;
  status: "Starting" | "Processing" | "Succeeded" | "Failed" | "Cancelled";
  task_id: string;
  task_url: string;
}

Output

The output response is a dictionary containing two keys: chunks and extracted_json. The chunks key contains the structured segments data as defined in the Chunks section, while the extracted_json key contains any structured data extracted from the document as defined in the Structured Extraction section. Below is an example of the output response:

{
  "chunks": [
    {
      "segments": [
        {
          "segment_id": "81729b96-086b-49ce-8112-de8e109bfc3c",
          "bbox": {
            "x1": 100,
            "y1": 150,
            "x2": 300,
            "y2": 350
          },
          "content": "Introduction to Black Holes",
          "html": "<p>Introduction to Black Holes</p>",
          "image": "https://storage.chunkr.ai/images/segment1.png",
          "markdown": "## Introduction to Black Holes",
          "ocr": [
            {
              "text": "Introduction to Black Holes",
              "confidence": 0.98
            }
          ],
          "page_height": 842.0,
          "page_number": 1,
          "page_width": 595.0,
          "segment_type": "Title"
        },
        {
          "segment_id": "a2712956-1234-4fce-9123-ld8e109bfc4d",
          "bbox": {
            "x1": 100,
            "y1": 360,
            "x2": 500,
            "y2": 550
          },
          "content": "Black holes are regions in space where the gravitational pull is so strong that nothing, not even light, can escape from them.",
          "html": "<p>Black holes are regions in space where the gravitational pull is so strong that nothing, not even light, can escape from them.</p>",
          "image": "",
          "markdown": "Black holes are regions in space where the gravitational pull is so strong that nothing, not even light, can escape from them.",
          "ocr": [
            {
              "text": "Black holes are regions in space where the gravitational pull is so strong that nothing, not even light, can escape from them.",
              "confidence": 0.95
            }
          ],
          "page_height": 842.0,
          "page_number": 1,
          "page_width": 595.0,
          "segment_type": "Text"
        }
      ],
      "chunk_length": 512
    }
  ],
  "extracted_json": {
    "title": "Clinical Trial Results",
    "schema_type": "object",
    "extracted_fields": [
      {
        "name": "black holes observed",
        "field_type": "list",
        "value": ["Sagittarius A*", "M87*"]
      },
      {
        "name": "implications of data",
        "field_type": "string",
        "value": "The observations confirm the predictions of general relativity."
      }
    ]
  }
}

Breakdown of the Extracted JSON

  • extracted_json: (Optional) Additional JSON data extracted based on the provided JSON schema during task creation.
    • title: The title of the extracted data.
    • schema_type: The type of JSON schema used.
    • extracted_fields: An array of fields extracted, each containing:
      • name: The name of the field.
      • field_type: The data type of the field (e.g., list, string).
      • value: The extracted value corresponding to the field.

For a comprehensive understanding of each component, please refer to the Structured Extraction and Chunks sections.


## Configuration

`Configuration` is the configuration you used to create the task. You can find more information about how to configure the task in the [configuration](/model) section.

## File Name

The file name is the name of the file you are processing.

## Input File Url

The input file url is a presigned url to the file you are processing.

## PDF Url

Apart from PDFs, we accept, DOC(X), PPT(X), and XLS(X) files. We generate a PDF for each of these file types that you can annotate in the `pdf_url` field in the `TaskResponse`.

## Message

The message is a message that is displayed to the user. It is used to provide more information about the task status and potential errors.

## Output

The output is the processed data of the file you are processing. You can find more information about the output in the [output](/chunks) section.

On status `Succeeded`, the output will contain a `chunks` array. Click [here](/chunks) to learn more about the chunk object.

## Page Count

The page count is the number of pages in the file you are processing.

## Status

The task status can be one of the following:

- `Starting`: The task is queued and waiting to be processed
- `Processing`: The task is still processing
- `Succeeded`: The task is successful
- `Failed`: The task failed
- `Cancelled`: The task was cancelled (coming soon)

## Task Url

The task url is the url to poll and get the task object.