Chunkr provides structured extraction capabilities to extract specific data fields from documents according to a defined JSON schema. This allows you to convert unstructured document content into structured data formats.

JSON Schema

When creating a task, you can provide a JSON schema that defines the structure of data you want to extract. Here is an example of how to set up a structured extraction task:

curl -v -X POST "http://localhost:8000/api/v1/task" \
 -H "Authorization: <your_api_key>" \
 -F "file=@/Users/pyscripts/input/test.pdf" \
 -F "model=Fast" \
 -F "target_chunk_length=512" \
 -F "ocr_strategy=Auto" \
 -F 'json_schema={
"title": "Basket",
"type": "object",
"properties": [
{
"name": "patient cohort information",
"title": "Patient Cohort Information",
"type": "string",
"description": "A summary of patient cohort information",
"default": null
},
{
"name": "implications of data",
"title": "Implications of Data",
"type": "string",
"description": "The implications of the observed data",
"default": null
}
]
};type=application/json'

JSON Schema Structure

The json_schema defines the structure of the data to be extracted. It consists of a title, type, and a list of properties. Each property represents a specific field to extract from the document.

TypeScript Interface Representation

Below are the TypeScript interfaces that model the JSON schema:

export interface JsonSchema {
  title: string;
  type: string;
  properties: Property[];
}

export interface Property {
  name: string;
  title?: string;
  type: string;
  description?: string;
  default?: string;
}

Property Fields Explanation

  • name: The identifier for the field in the extracted data.
  • title: A human-readable title for the field.
  • type: The data type of the field (e.g., string, list).
  • description: A description of what the field represents.
  • default: The default value for the field if no data is extracted.

Interpreting the Response

Once the task is completed, the response will include the extracted data in a structured format. Here is an example of the output:

{
    "configuration": {
        "model": "Fast",
        "ocr_strategy": "Auto",
        "target_chunk_length": 512
    },
    "created_at": "2023-04-15T10:30:00Z",
    "expires_at": "2023-04-22T10:30:00Z",
    "file_name": "document.pdf",
    "finished_at": "2023-04-15T10:31:30Z",
    "input_file_url": "https://storage.chunkr.ai/input/document.pdf",
    "message": "Task completed successfully",
    "output": {
        "chunk_length": 1,
        "segments": [
            {
                "bbox": {},
                "content": "",
                "html": "",
                "image": "",
                "markdown": "",
                "ocr": [],
                "page_height": 842.0,
                "page_number": 1,
                "page_width": 595.0,
                "segment_id": "<segment_id>",
                "segment_type": "Text"
            }
        ]
    },
    "page_count": 5,
    "status": "Succeeded",
    "task_id": "<task_id>",
    "task_url": "https://api.chunkr.ai/api/v1/task/<task_id>"
},
"extracted_json": {
    "title": "Clinical Trial Results",
    "schema_type": "object",
    "extracted_fields": [
        {
            "name": "patient cohort information",
            "field_type": "string",
            "value": "The patient cohort consisted of 30-50 year old males with a history of hypertension and diabetes."
        },
        {
            "name": "implications of data",
            "field_type": "string",
            "value": "The effect of citrus bergamot on heart rate was observed to be statistically significant in the clinical trial performed."
        }
    ]
}