# Get all supported file types
Source: https://docs.chunkr.ai/api-references/extras/get-all-supported-file-types
https://api.chunkr.ai/docs/openapi.json get /file-types
Returns a list of all file types supported by Chunkr, grouped by category.
Each category contains a list of formats, where each format includes an extension
paired with its corresponding MIME type.
# Delete File
Source: https://docs.chunkr.ai/api-references/files/delete-file
https://api.chunkr.ai/docs/openapi.json delete /files/{file_id}
Delete file contents and scrub sensitive metadata.
Minimal metadata is retained for audit and usage reporting per ZDR policy
# Download File Content
Source: https://docs.chunkr.ai/api-references/files/download-file-content
https://api.chunkr.ai/docs/openapi.json get /files/{file_id}/content
Streams the file bytes directly if authorized. The response will set the
`Content-Type` header to the file's detected MIME type.
# Get File
Source: https://docs.chunkr.ai/api-references/files/get-file
https://api.chunkr.ai/docs/openapi.json get /files/{file_id}
Returns metadata for a file owned by the authenticated user.
The response includes a permanent `ch://files/{file_id}` URL,
file name, content type, size, user-provided metadata, and timestamps.
If the file is not found or the user is not authorized, the response will be 401 Unauthorized.
# Get File URL
Source: https://docs.chunkr.ai/api-references/files/get-file-url
https://api.chunkr.ai/docs/openapi.json get /files/{file_id}/url
Returns a presigned download URL by default. If `base64_urls=true`, returns
base64-encoded file content. Control expiry with `expires_in` (seconds).
# List Files
Source: https://docs.chunkr.ai/api-references/files/list-files
https://api.chunkr.ai/docs/openapi.json get /files
Lists files for the authenticated user with cursor-based pagination and optional filtering by date range.
# Upload a file
Source: https://docs.chunkr.ai/api-references/files/upload-a-file
https://api.chunkr.ai/docs/openapi.json post /files
Accepts multipart/form-data with fields:
- file: binary (required)
- file_metadata: string (optional, JSON string)
# Health Check
Source: https://docs.chunkr.ai/api-references/health/health-check
https://api.chunkr.ai/docs/openapi.json get /health
Confirmation that the service can respond to requests
# Task extract updated
Source: https://docs.chunkr.ai/api-references/task-extract-updated
https://api.chunkr.ai/docs/openapi.json webhook task.extract.updated
An extract task has been updated - the event is sent when the status or message for a task changes.
# Task parse updated
Source: https://docs.chunkr.ai/api-references/task-parse-updated
https://api.chunkr.ai/docs/openapi.json webhook task.parse.updated
A parse task has been updated - the event is sent when the status or message for a task changes.
# Cancel Task
Source: https://docs.chunkr.ai/api-references/tasks/cancel-task
https://api.chunkr.ai/docs/openapi.json get /tasks/{task_id}/cancel
Cancel a task that hasn't started processing yet:
- For new tasks: Status will be updated to `Cancelled`
- For updating tasks: Task will revert to the previous state
Requirements:
- Task must have status `Starting`
# Create Extract Task
Source: https://docs.chunkr.ai/api-references/tasks/create-extract-task
https://api.chunkr.ai/docs/openapi.json post /tasks/extract
Queues a document/parsed task for extraction and returns a `TaskResponse` with the
assigned `task_id`, initial configuration, file metadata, and timestamps.
The initial status is `Starting`.
Creates an extract task and returns its metadata immediately.
# Create Parse Task
Source: https://docs.chunkr.ai/api-references/tasks/create-parse-task
https://api.chunkr.ai/docs/openapi.json post /tasks/parse
Queues a document for processing and returns a `TaskResponse` with the
assigned `task_id`, initial configuration, file metadata, and timestamps.
The initial status is `Starting`.
Creates a parse task and returns its metadata immediately.
# Delete Task
Source: https://docs.chunkr.ai/api-references/tasks/delete-task
https://api.chunkr.ai/docs/openapi.json delete /tasks/{task_id}
Delete a task by its ID.
Requirements:
- Task must have status `Succeeded` or `Failed`
# Get Extract Task
Source: https://docs.chunkr.ai/api-references/tasks/get-extract-task
https://api.chunkr.ai/docs/openapi.json get /tasks/{task_id}/extract
Retrieves the current state of an extract task.
Returns task details such as processing status, configuration, output (when
available), file metadata, and timestamps.
Typical uses:
- Poll a task during processing
- Retrieve the final output once processing is complete
- Access task metadata and configuration
# Get Parse Task
Source: https://docs.chunkr.ai/api-references/tasks/get-parse-task
https://api.chunkr.ai/docs/openapi.json get /tasks/{task_id}/parse
Retrieves the current state of a parse task.
Returns task details such as processing status, configuration, output (when
available), file metadata, and timestamps.
Typical uses:
- Poll a task during processing
- Retrieve the final output once processing is complete
- Access task metadata and configuration
# Get Task
Source: https://docs.chunkr.ai/api-references/tasks/get-task
https://api.chunkr.ai/docs/openapi.json get /tasks/{task_id}
Retrieves the current state of a task.
Returns task details such as processing status, configuration, output (when
available), file metadata, and timestamps.
Typical uses:
- Poll a task during processing
- Retrieve the final output once processing is complete
- Access task metadata and configuration
# List Tasks
Source: https://docs.chunkr.ai/api-references/tasks/list-tasks
https://api.chunkr.ai/docs/openapi.json get /tasks
Lists tasks for the authenticated user with cursor-based pagination
and optional filtering by date range. Supports ascending or descending
sort order and optional inclusion of chunks/base64 URLs.
# Get webhook URL
Source: https://docs.chunkr.ai/api-references/webhook/get-webhook-url
https://api.chunkr.ai/docs/openapi.json get /webhook/url
Get or create webhook for user and return dashboard URL
# Extract Examples
Source: https://docs.chunkr.ai/pages/features/extract/examples
Extract structured data from documents using defined schemas.
Practical examples showing how to extract structured data from different document types using Python (Pydantic) and TypeScript (Zod) schemas.
For a complete understanding of Extract outputs, see [Extract
Outputs](/pages/features/extract/outputs). For configuration options, see
[Extract Overview](/pages/features/extract/overview).
## Financial Report
Extract key financial metrics and data from financial statements, earnings reports, and other financial documents.
### Schema
```python Python theme={"system"}
import os
from typing import List, Optional
from chunkr_ai import Chunkr
from pydantic import BaseModel, Field
class FinancialPositionItem(BaseModel):
category: str
subcategory: Optional[str]
account_name: str
amount: float
class FinancialPositionPeriod(BaseModel):
as_of_date: str = Field(description="Date for this financial position snapshot")
items: List[FinancialPositionItem]
class StatementOfFinancialPosition(BaseModel):
currency: str
periods: List[FinancialPositionPeriod]
class ProfitOrLossItem(BaseModel):
account_name: str = Field(description="Name of the profit and loss account")
amount: float
class StatementOfProfitAndLoss(BaseModel):
period_start: str
period_end: str
currency: str
items: List[ProfitOrLossItem] = Field(description="List of profit and loss items")
net_income: float = Field(description="Total net income for this period")
class ComprehensiveIncomeItem(BaseModel):
account_name: str = Field(description="Name of the comprehensive income account")
amount: float = Field(description="Monetary value in the statement's base currency")
class ComprehensiveIncomePeriod(BaseModel):
period_end: str = Field(description="Ending date for this period (YYYY-MM-DD)")
items: List[ComprehensiveIncomeItem]
total_comprehensive_income: float
class StatementOfComprehensiveIncome(BaseModel):
currency: str
periods: List[ComprehensiveIncomePeriod]
class EquityChangeItem(BaseModel):
component: str = Field(description="Name of the equity component")
opening_balance: float = Field(description="Balance at the beginning of the period")
changes: List[str] = Field(description="Descriptions of changes during the period")
closing_balance: float = Field(description="Balance at the end of the period")
class StatementOfChangesInEquity(BaseModel):
period_end: str
currency: str
items: List[EquityChangeItem]
class CashFlowItem(BaseModel):
activity_type: str = Field(description="Type of cash flow activity")
account_name: str = Field(description="Name of the cash flow account")
amount: float
class CashFlowPeriod(BaseModel):
period_end: str
items: List[CashFlowItem] = Field(description="List of cash flow items")
net_increase_in_cash: float = Field(description="Net increase for this period")
closing_cash_balance: float = Field(description="Balance at the end of this period")
class StatementOfCashFlows(BaseModel):
currency: str
periods: List[CashFlowPeriod] = Field(description="List of cash flow periods")
class FinancialReport(BaseModel):
entity_name: str = Field(description="Name of the reporting entity")
report_title: Optional[str]
reporting_period_start: Optional[str]
reporting_period_end: Optional[str]
reporting_as_of_date: Optional[str]
currency: str
statement_of_financial_position: Optional[StatementOfFinancialPosition]
statement_of_profit_or_loss: Optional[StatementOfProfitAndLoss]
statement_of_comprehensive_income: Optional[StatementOfComprehensiveIncome]
statement_of_changes_in_equity: Optional[StatementOfChangesInEquity]
statement_of_cash_flows: Optional[StatementOfCashFlows]
```
```typescript TypeScript theme={"system"}
import * as z from "zod";
import Chunkr from "chunkr-ai";
const FinancialPositionItemSchema = z.object({
category: z.string(),
subcategory: z.string().optional(),
account_name: z.string(),
amount: z.number(),
});
const FinancialPositionPeriodSchema = z.object({
as_of_date: z.string().describe("Date for this financial position snapshot"),
items: z.array(FinancialPositionItemSchema),
});
const StatementOfFinancialPositionSchema = z.object({
currency: z.string(),
periods: z.array(FinancialPositionPeriodSchema),
});
const ProfitOrLossItemSchema = z.object({
account_name: z.string().describe("Name of the profit and loss account"),
amount: z.number(),
});
const StatementOfProfitAndLossSchema = z.object({
period_start: z.string(),
period_end: z.string(),
currency: z.string(),
items: z
.array(ProfitOrLossItemSchema)
.describe("List of profit and loss items"),
net_income: z.number().describe("Total net income for this period"),
});
const ComprehensiveIncomeItemSchema = z.object({
account_name: z.string().describe("Name of the comprehensive income account"),
amount: z
.number()
.describe("Monetary value in the statement's base currency"),
});
const ComprehensiveIncomePeriodSchema = z.object({
period_end: z.string().describe("Ending date for this period (YYYY-MM-DD)"),
items: z.array(ComprehensiveIncomeItemSchema),
total_comprehensive_income: z.number(),
});
const StatementOfComprehensiveIncomeSchema = z.object({
currency: z.string(),
periods: z.array(ComprehensiveIncomePeriodSchema),
});
const EquityChangeItemSchema = z.object({
component: z.string().describe("Name of the equity component"),
opening_balance: z
.number()
.describe("Balance at the beginning of the period"),
changes: z
.array(z.string())
.describe("Descriptions of changes during the period"),
closing_balance: z.number().describe("Balance at the end of the period"),
});
const StatementOfChangesInEquitySchema = z.object({
period_end: z.string(),
currency: z.string(),
items: z.array(EquityChangeItemSchema),
});
const CashFlowItemSchema = z.object({
activity_type: z.string().describe("Type of cash flow activity"),
account_name: z.string().describe("Name of the cash flow account"),
amount: z.number(),
});
const CashFlowPeriodSchema = z.object({
period_end: z.string(),
items: z.array(CashFlowItemSchema).describe("List of cash flow items"),
net_increase_in_cash: z.number().describe("Net increase for this period"),
closing_cash_balance: z
.number()
.describe("Balance at the end of this period"),
});
const StatementOfCashFlowsSchema = z.object({
currency: z.string(),
periods: z.array(CashFlowPeriodSchema).describe("List of cash flow periods"),
});
const FinancialReportSchema = z.object({
entity_name: z.string().describe("Name of the reporting entity"),
report_title: z.string().optional(),
reporting_period_start: z.string().optional(),
reporting_period_end: z.string().optional(),
reporting_as_of_date: z.string().optional(),
currency: z.string(),
statement_of_financial_position:
StatementOfFinancialPositionSchema.optional(),
statement_of_profit_or_loss: StatementOfProfitAndLossSchema.optional(),
statement_of_comprehensive_income:
StatementOfComprehensiveIncomeSchema.optional(),
statement_of_changes_in_equity: StatementOfChangesInEquitySchema.optional(),
statement_of_cash_flows: StatementOfCashFlowsSchema.optional(),
});
```
### Process
```python Python theme={"system"}
# Convert Pydantic model to JSON schema
schema = FinancialReport.model_json_schema()
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# Create extract task
task = client.tasks.extract.create(
file="https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/financial_report.pdf",
schema=schema,
)
```
```typescript TypeScript theme={"system"}
// Convert Zod schema to JSON schema
const schema = z.toJSONSchema(FinancialReportSchema);
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY });
// Create extract task
const task = client.tasks.extract.create({
file: "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/financial_report.pdf",
schema: schema,
});
```
### Output
```json Result expandable theme={"system"}
{
"currency": "JPY",
"entity_name": "SoftBank Group Corp.",
"report_title": "Consolidated Financial Report",
"reporting_as_of_date": "2025-03-31",
"reporting_period_end": "2025-03-31",
"reporting_period_start": "2024-04-01",
"statement_of_cash_flows": {
"currency": "JPY",
"periods": [
{
"closing_cash_balance": 6186874,
"items": [
{
"account_name": "Net income",
"activity_type": "Operating Activities",
"amount": 209217
},
{
"account_name": "Depreciation and amortization",
"activity_type": "Operating Activities",
"amount": 858620
},
{
"account_name": "Loss (gain) on investments at Investment Business of Holding Companies",
"activity_type": "Operating Activities",
"amount": 449817
},
{
"account_name": "Loss (gain) on investments at SoftBank Vision Funds",
"activity_type": "Operating Activities",
"amount": 167290
},
{
"account_name": "Finance cost",
"activity_type": "Operating Activities",
"amount": 556004
},
{
"account_name": "Foreign exchange loss (gain)",
"activity_type": "Operating Activities",
"amount": 703122
},
{
"account_name": "Derivative (gain) loss (excluding (gain) loss on investments)",
"activity_type": "Operating Activities",
"amount": -1502326
},
{
"account_name": "Change in third-party interests in SVF",
"activity_type": "Operating Activities",
"amount": 390137
},
{
"account_name": "(Gain) loss on other investments and other gain",
"activity_type": "Operating Activities",
"amount": -271064
},
{
"account_name": "Income taxes",
"activity_type": "Operating Activities",
"amount": -151416
},
{
"account_name": "Increase in investments from asset management subsidiaries",
"activity_type": "Operating Activities",
"amount": -230986
},
{
"account_name": "Increase in trade and other receivables",
"activity_type": "Operating Activities",
"amount": -476511
},
{
"account_name": "Decrease (increase) in inventories",
"activity_type": "Operating Activities",
"amount": 5436
},
{
"account_name": "Increase in trade and other payables",
"activity_type": "Operating Activities",
"amount": 325731
},
{
"account_name": "Other",
"activity_type": "Operating Activities",
"amount": 208593
},
{
"account_name": "Interest and dividends received",
"activity_type": "Operating Activities",
"amount": 256083
},
{
"account_name": "Interest paid",
"activity_type": "Operating Activities",
"amount": -430422
},
{
"account_name": "Income taxes paid",
"activity_type": "Operating Activities",
"amount": -885617
},
{
"account_name": "Income taxes refunded",
"activity_type": "Operating Activities",
"amount": 68839
},
{
"account_name": "Payments for acquisition of investments",
"activity_type": "Investing Activities",
"amount": -800925
},
{
"account_name": "Proceeds from sales/redemption of investments",
"activity_type": "Investing Activities",
"amount": 219668
},
{
"account_name": "Payments for acquisition of investments by SVF",
"activity_type": "Investing Activities",
"amount": -212045
},
{
"account_name": "Proceeds from sales of investments by SVF",
"activity_type": "Investing Activities",
"amount": 922020
},
{
"account_name": "Payments for acquisition of investments by asset management subsidiaries",
"activity_type": "Investing Activities",
"amount": -76877
},
{
"account_name": "Payments (net) for acquisition of control over subsidiaries",
"activity_type": "Investing Activities",
"amount": -104484
},
{
"account_name": "Proceeds (net) from loss of control over subsidiaries",
"activity_type": "Investing Activities",
"amount": 96755
},
{
"account_name": "Purchase of property, plant and equipment, and intangible assets",
"activity_type": "Investing Activities",
"amount": -622612
},
{
"account_name": "Payments for loan receivables",
"activity_type": "Investing Activities",
"amount": -313686
},
{
"account_name": "Collection of loan receivables",
"activity_type": "Investing Activities",
"amount": 107481
},
{
"account_name": "Payments into time deposits",
"activity_type": "Investing Activities",
"amount": -148657
},
{
"account_name": "Proceeds from withdrawal of time deposits",
"activity_type": "Investing Activities",
"amount": 77954
},
{
"account_name": "Other",
"activity_type": "Investing Activities",
"amount": 13947
},
{
"account_name": "Proceeds in (repayment of) short-term interest-bearing debt, net",
"activity_type": "Financing Activities",
"amount": 202074
},
{
"account_name": "Proceeds from interest-bearing debt",
"activity_type": "Financing Activities",
"amount": 5181190
},
{
"account_name": "Repayment of interest-bearing debt",
"activity_type": "Financing Activities",
"amount": -5175486
},
{
"account_name": "Repayment of lease liabilities",
"activity_type": "Financing Activities",
"amount": -211231
},
{
"account_name": "Distribution/repayment from SVF to third-party investors",
"activity_type": "Financing Activities",
"amount": -783522
},
{
"account_name": "Proceeds from the partial sales of shares of subsidiaries to non-controlling interests",
"activity_type": "Financing Activities",
"amount": 747565
},
{
"account_name": "Purchase of shares of subsidiaries from non-controlling interests",
"activity_type": "Financing Activities",
"amount": -112009
},
{
"account_name": "Redemption of other equity instruments",
"activity_type": "Financing Activities",
"amount": -277760
},
{
"account_name": "Distribution to owners of other equity instruments",
"activity_type": "Financing Activities",
"amount": -25624
},
{
"account_name": "Proceeds from the issuance of other equity instruments in subsidiaries",
"activity_type": "Financing Activities",
"amount": 120000
},
{
"account_name": "Purchase of treasury stock",
"activity_type": "Financing Activities",
"amount": -8
},
{
"account_name": "Cash dividends paid",
"activity_type": "Financing Activities",
"amount": -64356
},
{
"account_name": "Cash dividends paid to non-controlling interests",
"activity_type": "Financing Activities",
"amount": -288119
},
{
"account_name": "Other",
"activity_type": "Financing Activities",
"amount": 81064
},
{
"account_name": "Effect of exchange rate changes on cash and cash equivalents",
"activity_type": "Other",
"amount": 491868
},
{
"account_name": "(Decrease) increase in cash and cash equivalents relating to transfer of assets classified as held for sale",
"activity_type": "Other",
"amount": -33011
}
],
"net_increase_in_cash": -738279,
"period_end": "2024-03-31"
},
{
"closing_cash_balance": 3713028,
"items": [
{
"account_name": "Net income",
"activity_type": "Operating Activities",
"amount": 1603108
},
{
"account_name": "Depreciation and amortization",
"activity_type": "Operating Activities",
"amount": 866823
},
{
"account_name": "Loss (gain) on investments at Investment Business of Holding Companies",
"activity_type": "Operating Activities",
"amount": -3422188
},
{
"account_name": "Loss (gain) on investments at SoftBank Vision Funds",
"activity_type": "Operating Activities",
"amount": -387584
},
{
"account_name": "Finance cost",
"activity_type": "Operating Activities",
"amount": 581559
},
{
"account_name": "Foreign exchange loss (gain)",
"activity_type": "Operating Activities",
"amount": -27055
},
{
"account_name": "Derivative (gain) loss (excluding (gain) loss on investments)",
"activity_type": "Operating Activities",
"amount": 2034029
},
{
"account_name": "Change in third-party interests in SVF",
"activity_type": "Operating Activities",
"amount": 491898
},
{
"account_name": "(Gain) loss on other investments and other gain",
"activity_type": "Operating Activities",
"amount": -253953
},
{
"account_name": "Income taxes",
"activity_type": "Operating Activities",
"amount": 101613
},
{
"account_name": "Increase in investments from asset management subsidiaries",
"activity_type": "Operating Activities",
"amount": -769572
},
{
"account_name": "Increase in trade and other receivables",
"activity_type": "Operating Activities",
"amount": -508544
},
{
"account_name": "Decrease (increase) in inventories",
"activity_type": "Operating Activities",
"amount": -40000
},
{
"account_name": "Increase in trade and other payables",
"activity_type": "Operating Activities",
"amount": 237030
},
{
"account_name": "Other",
"activity_type": "Operating Activities",
"amount": 93974
},
{
"account_name": "Interest and dividends received",
"activity_type": "Operating Activities",
"amount": 299714
},
{
"account_name": "Interest paid",
"activity_type": "Operating Activities",
"amount": -482111
},
{
"account_name": "Income taxes paid",
"activity_type": "Operating Activities",
"amount": -380008
},
{
"account_name": "Income taxes refunded",
"activity_type": "Operating Activities",
"amount": 164847
},
{
"account_name": "Payments for acquisition of investments",
"activity_type": "Investing Activities",
"amount": -1625245
},
{
"account_name": "Proceeds from sales/redemption of investments",
"activity_type": "Investing Activities",
"amount": 1180746
},
{
"account_name": "Payments for acquisition of investments by SVF",
"activity_type": "Investing Activities",
"amount": -578927
},
{
"account_name": "Proceeds from sales of investments by SVF",
"activity_type": "Investing Activities",
"amount": 458319
},
{
"account_name": "Payments for acquisition of investments by asset management subsidiaries",
"activity_type": "Investing Activities",
"amount": 0
},
{
"account_name": "Payments (net) for acquisition of control over subsidiaries",
"activity_type": "Investing Activities",
"amount": -194216
},
{
"account_name": "Proceeds (net) from loss of control over subsidiaries",
"activity_type": "Investing Activities",
"amount": 94862
},
{
"account_name": "Purchase of property, plant and equipment, and intangible assets",
"activity_type": "Investing Activities",
"amount": -854173
},
{
"account_name": "Payments for loan receivables",
"activity_type": "Investing Activities",
"amount": -36538
},
{
"account_name": "Collection of loan receivables",
"activity_type": "Investing Activities",
"amount": 119384
},
{
"account_name": "Payments into time deposits",
"activity_type": "Investing Activities",
"amount": -139211
},
{
"account_name": "Proceeds from withdrawal of time deposits",
"activity_type": "Investing Activities",
"amount": 166897
},
{
"account_name": "Other",
"activity_type": "Investing Activities",
"amount": -223438
},
{
"account_name": "Proceeds in (repayment of) short-term interest-bearing debt, net",
"activity_type": "Financing Activities",
"amount": -421723
},
{
"account_name": "Proceeds from interest-bearing debt",
"activity_type": "Financing Activities",
"amount": 5313665
},
{
"account_name": "Repayment of interest-bearing debt",
"activity_type": "Financing Activities",
"amount": -3809082
},
{
"account_name": "Repayment of lease liabilities",
"activity_type": "Financing Activities",
"amount": -186441
},
{
"account_name": "Distribution/repayment from SVF to third-party investors",
"activity_type": "Financing Activities",
"amount": -1485774
},
{
"account_name": "Proceeds from the partial sales of shares of subsidiaries to non-controlling interests",
"activity_type": "Financing Activities",
"amount": 0
},
{
"account_name": "Purchase of shares of subsidiaries from non-controlling interests",
"activity_type": "Financing Activities",
"amount": -79581
},
{
"account_name": "Redemption of other equity instruments",
"activity_type": "Financing Activities",
"amount": 0
},
{
"account_name": "Distribution to owners of other equity instruments",
"activity_type": "Financing Activities",
"amount": -18867
},
{
"account_name": "Proceeds from the issuance of other equity instruments in subsidiaries",
"activity_type": "Financing Activities",
"amount": 200000
},
{
"account_name": "Purchase of treasury stock",
"activity_type": "Financing Activities",
"amount": -237058
},
{
"account_name": "Cash dividends paid",
"activity_type": "Financing Activities",
"amount": -64020
},
{
"account_name": "Cash dividends paid to non-controlling interests",
"activity_type": "Financing Activities",
"amount": -368678
},
{
"account_name": "Other",
"activity_type": "Financing Activities",
"amount": 41175
},
{
"account_name": "Effect of exchange rate changes on cash and cash equivalents",
"activity_type": "Other",
"amount": 37487
},
{
"account_name": "(Decrease) increase in cash and cash equivalents relating to transfer of assets classified as held for sale",
"activity_type": "Other",
"amount": 33011
}
],
"net_increase_in_cash": -2473846,
"period_end": "2025-03-31"
}
]
},
"statement_of_changes_in_equity": {
"currency": "JPY",
"items": [
{
"changes": [],
"closing_balance": 238772,
"component": "Common stock",
"opening_balance": 238772
},
{
"changes": [
"Changes in interests in subsidiaries",
"Share-based payment transactions",
"Other"
],
"closing_balance": 3376724,
"component": "Capital surplus",
"opening_balance": 3326093
},
{
"changes": [],
"closing_balance": 193199,
"component": "Other equity instruments",
"opening_balance": 193199
},
{
"changes": [
"Net income",
"Cash dividends",
"Distribution to owners of other equity instruments",
"Transfer of accumulated other comprehensive income to retained earnings",
"Purchase and disposal of treasury stock"
],
"closing_balance": 2701792,
"component": "Retained earnings",
"opening_balance": 1632966
},
{
"changes": [
"Purchase and disposal of treasury stock"
],
"closing_balance": -256251,
"component": "Treasury stock",
"opening_balance": -22725
},
{
"changes": [
"Other comprehensive income",
"Transfer of accumulated other comprehensive income to retained earnings"
],
"closing_balance": 5307305,
"component": "Accumulated other comprehensive income",
"opening_balance": 5793820
}
],
"period_end": "2025-03-31"
},
"statement_of_comprehensive_income": {
"currency": "JPY",
"periods": [
{
"items": [
{
"account_name": "Net income",
"amount": 209217
},
{
"account_name": "Remeasurements of defined benefit plan",
"amount": -308
},
{
"account_name": "Equity financial assets at FVTOCI",
"amount": 10777
},
{
"account_name": "Share of other comprehensive income of associates",
"amount": 326
},
{
"account_name": "Debt financial assets at FVTOCI",
"amount": -286
},
{
"account_name": "Cash flow hedges",
"amount": 24007
},
{
"account_name": "Exchange differences on translating foreign operations",
"amount": 2000916
},
{
"account_name": "Share of other comprehensive income of associates",
"amount": -3208
}
],
"period_end": "2024-03-31",
"total_comprehensive_income": 2241441
},
{
"items": [
{
"account_name": "Net income",
"amount": 1603108
},
{
"account_name": "Remeasurements of defined benefit plan",
"amount": 2598
},
{
"account_name": "Equity financial assets at FVTOCI",
"amount": -13757
},
{
"account_name": "Share of other comprehensive income of associates",
"amount": 162
},
{
"account_name": "Debt financial assets at FVTOCI",
"amount": -2373
},
{
"account_name": "Cash flow hedges",
"amount": 42263
},
{
"account_name": "Exchange differences on translating foreign operations",
"amount": -547774
},
{
"account_name": "Share of other comprehensive income of associates",
"amount": -1879
}
],
"period_end": "2025-03-31",
"total_comprehensive_income": 1082348
}
]
},
"statement_of_financial_position": {
"currency": "JPY",
"periods": [
{
"as_of_date": "2024-03-31",
"items": [
{
"account_name": "Cash and cash equivalents",
"amount": 6186874,
"category": "Assets",
"subcategory": "Current assets"
},
{
"account_name": "Trade and other receivables",
"amount": 2868767,
"category": "Assets",
"subcategory": "Current assets"
},
{
"account_name": "Derivative financial assets",
"amount": 852350,
"category": "Assets",
"subcategory": "Current assets"
},
{
"account_name": "Other financial assets",
"amount": 777996,
"category": "Assets",
"subcategory": "Current assets"
},
{
"account_name": "Inventories",
"amount": 161863,
"category": "Assets",
"subcategory": "Current assets"
},
{
"account_name": "Other current assets",
"amount": 550984,
"category": "Assets",
"subcategory": "Current assets"
},
{
"account_name": "Assets classified as held for sale",
"amount": 42559,
"category": "Assets",
"subcategory": "Current assets"
},
{
"account_name": "Property, plant and equipment",
"amount": 1895289,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Right-of-use assets",
"amount": 746903,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Goodwill",
"amount": 5709874,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Intangible assets",
"amount": 2448840,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Costs to obtain contracts",
"amount": 317650,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Investments accounted for using the equity method",
"amount": 839208,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Investments from SVF (FVTPL)",
"amount": 11014487,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Investment securities",
"amount": 9061972,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Derivative financial assets",
"amount": 385528,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Other financial assets",
"amount": 2424282,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Deferred tax assets",
"amount": 245954,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Other non-current assets",
"amount": 192863,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Interest-bearing debt",
"amount": 8271143,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Lease liabilities",
"amount": 149801,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Deposits for banking business",
"amount": 1643155,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Trade and other payables",
"amount": 2710529,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Derivative financial liabilities",
"amount": 195090,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Other financial liabilities",
"amount": 31801,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Income taxes payable",
"amount": 163226,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Provisions",
"amount": 44704,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Other current liabilities",
"amount": 801285,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Liabilities directly relating to assets classified as held for sale",
"amount": 9561,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Interest-bearing debt",
"amount": 12296381,
"category": "Liabilities",
"subcategory": "Non-current liabilities"
},
{
"account_name": "Lease liabilities",
"amount": 644706,
"category": "Liabilities",
"subcategory": "Non-current liabilities"
},
{
"account_name": "Third-party interests in SVF",
"amount": 4694503,
"category": "Liabilities",
"subcategory": "Non-current liabilities"
},
{
"account_name": "Derivative financial liabilities",
"amount": 41238,
"category": "Liabilities",
"subcategory": "Non-current liabilities"
},
{
"account_name": "Other financial liabilities",
"amount": 57017,
"category": "Liabilities",
"subcategory": "Non-current liabilities"
},
{
"account_name": "Provisions",
"amount": 167902,
"category": "Liabilities",
"subcategory": "Non-current liabilities"
},
{
"account_name": "Deferred tax liabilities",
"amount": 1253039,
"category": "Liabilities",
"subcategory": "Non-current liabilities"
},
{
"account_name": "Other non-current liabilities",
"amount": 311993,
"category": "Liabilities",
"subcategory": "Non-current liabilities"
},
{
"account_name": "Common stock",
"amount": 238772,
"category": "Equity",
"subcategory": "Equity attributable to owners of the parent"
},
{
"account_name": "Capital surplus",
"amount": 3326093,
"category": "Equity",
"subcategory": "Equity attributable to owners of the parent"
},
{
"account_name": "Other equity instruments",
"amount": 193199,
"category": "Equity",
"subcategory": "Equity attributable to owners of the parent"
},
{
"account_name": "Retained earnings",
"amount": 1632966,
"category": "Equity",
"subcategory": "Equity attributable to owners of the parent"
},
{
"account_name": "Treasury stock",
"amount": -22725,
"category": "Equity",
"subcategory": "Equity attributable to owners of the parent"
},
{
"account_name": "Accumulated other comprehensive income",
"amount": 5793820,
"category": "Equity",
"subcategory": "Equity attributable to owners of the parent"
},
{
"account_name": "Non-controlling interests",
"amount": 2075044,
"category": "Equity",
"subcategory": null
}
]
},
{
"as_of_date": "2025-03-31",
"items": [
{
"account_name": "Cash and cash equivalents",
"amount": 3713028,
"category": "Assets",
"subcategory": "Current assets"
},
{
"account_name": "Trade and other receivables",
"amount": 3008144,
"category": "Assets",
"subcategory": "Current assets"
},
{
"account_name": "Derivative financial assets",
"amount": 111258,
"category": "Assets",
"subcategory": "Current assets"
},
{
"account_name": "Other financial assets",
"amount": 1485877,
"category": "Assets",
"subcategory": "Current assets"
},
{
"account_name": "Inventories",
"amount": 198291,
"category": "Assets",
"subcategory": "Current assets"
},
{
"account_name": "Other current assets",
"amount": 365880,
"category": "Assets",
"subcategory": "Current assets"
},
{
"account_name": "Assets classified as held for sale",
"amount": 550440,
"category": "Assets",
"subcategory": "Current assets"
},
{
"account_name": "Property, plant and equipment",
"amount": 2830185,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Right-of-use assets",
"amount": 857961,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Goodwill",
"amount": 5781931,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Intangible assets",
"amount": 2414562,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Costs to obtain contracts",
"amount": 383022,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Investments accounted for using the equity method",
"amount": 502995,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Investments from SVF (FVTPL)",
"amount": 11410922,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Investment securities",
"amount": 8040068,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Derivative financial assets",
"amount": 168248,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Other financial assets",
"amount": 2767625,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Deferred tax assets",
"amount": 207987,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Other non-current assets",
"amount": 215332,
"category": "Assets",
"subcategory": "Non-current assets"
},
{
"account_name": "Interest-bearing debt",
"amount": 5629648,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Lease liabilities",
"amount": 165355,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Deposits for banking business",
"amount": 1795965,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Trade and other payables",
"amount": 3036349,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Derivative financial liabilities",
"amount": 840469,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Other financial liabilities",
"amount": 5940,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Income taxes payable",
"amount": 444180,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Provisions",
"amount": 54047,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Other current liabilities",
"amount": 629717,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Liabilities directly relating to assets classified as held for sale",
"amount": 0,
"category": "Liabilities",
"subcategory": "Current liabilities"
},
{
"account_name": "Interest-bearing debt",
"amount": 12376682,
"category": "Liabilities",
"subcategory": "Non-current liabilities"
},
{
"account_name": "Lease liabilities",
"amount": 741665,
"category": "Liabilities",
"subcategory": "Non-current liabilities"
},
{
"account_name": "Third-party interests in SVF",
"amount": 3652797,
"category": "Liabilities",
"subcategory": "Non-current liabilities"
},
{
"account_name": "Derivative financial liabilities",
"amount": 104197,
"category": "Liabilities",
"subcategory": "Non-current liabilities"
},
{
"account_name": "Other financial liabilities",
"amount": 199284,
"category": "Liabilities",
"subcategory": "Non-current liabilities"
},
{
"account_name": "Provisions",
"amount": 155436,
"category": "Liabilities",
"subcategory": "Non-current liabilities"
},
{
"account_name": "Deferred tax liabilities",
"amount": 924392,
"category": "Liabilities",
"subcategory": "Non-current liabilities"
},
{
"account_name": "Other non-current liabilities",
"amount": 304607,
"category": "Liabilities",
"subcategory": "Non-current liabilities"
},
{
"account_name": "Common stock",
"amount": 238772,
"category": "Equity",
"subcategory": "Equity attributable to owners of the parent"
},
{
"account_name": "Capital surplus",
"amount": 3376724,
"category": "Equity",
"subcategory": "Equity attributable to owners of the parent"
},
{
"account_name": "Other equity instruments",
"amount": 193199,
"category": "Equity",
"subcategory": "Equity attributable to owners of the parent"
},
{
"account_name": "Retained earnings",
"amount": 2701792,
"category": "Equity",
"subcategory": "Equity attributable to owners of the parent"
},
{
"account_name": "Treasury stock",
"amount": -256251,
"category": "Equity",
"subcategory": "Equity attributable to owners of the parent"
},
{
"account_name": "Accumulated other comprehensive income",
"amount": 5307305,
"category": "Equity",
"subcategory": "Equity attributable to owners of the parent"
},
{
"account_name": "Non-controlling interests",
"amount": 2391485,
"category": "Equity",
"subcategory": null
}
]
}
]
},
"statement_of_profit_or_loss": {
"currency": "JPY",
"items": [
{
"account_name": "Net sales",
"amount": 7243752
},
{
"account_name": "Cost of sales",
"amount": -3489549
},
{
"account_name": "Gross profit",
"amount": 3754203
},
{
"account_name": "Gain (loss) on investments at Investment Business of Holding Companies",
"amount": 3413821
},
{
"account_name": "Gain (loss) on investments at SoftBank Vision Funds",
"amount": 387584
},
{
"account_name": "Gain (loss) on other investments",
"amount": -100298
},
{
"account_name": "Total gain on investments",
"amount": 3701107
},
{
"account_name": "Selling, general and administrative expenses",
"amount": -3024409
},
{
"account_name": "Finance cost",
"amount": -581559
},
{
"account_name": "Foreign exchange gain (loss)",
"amount": 27055
},
{
"account_name": "Derivative gain (loss) (excluding gain (loss) on investments)",
"amount": -2034029
},
{
"account_name": "Change in third-party interests in SVF",
"amount": -491898
},
{
"account_name": "Other gain",
"amount": 354251
},
{
"account_name": "Income before income tax",
"amount": 1704721
},
{
"account_name": "Income taxes",
"amount": -101613
}
],
"net_income": 1603108,
"period_end": "2025-03-31",
"period_start": "2024-04-01"
}
}
```
```json Citations expandable theme={"system"}
{
"currency": [
{
"bboxes": [
{
"height": 20.03041076660156,
"left": 955.2528076171876,
"top": 156.55679321289062,
"width": 122.904052734375
}
],
"citation_id": "smyIOGl",
"citation_type": "Segment",
"content": "(Millions yen) of",
"page_height": 1684,
"page_number": 1,
"page_width": 1190,
"segment_id": "msfcd8k",
"segment_type": "Caption"
},
{
"bboxes": [
{
"height": 19.39678955078125,
"left": 1043.5679931640625,
"top": 156.64320373535156,
"width": 34.5888671875
}
],
"citation_id": "P41EBY9",
"citation_type": "Word",
"content": "yen)",
"page_height": 1684,
"page_number": 1,
"page_width": 1190,
"segment_type": "Text"
},
// .. more citations
],
"entity_name": [
{
"bboxes": [
{
"height": 34.08674621582031,
"left": 793.8571166992188,
"top": 50.41603088378906,
"width": 343.96636962890625
}
],
"citation_id": "_jotB2i",
"citation_type": "Segment",
"content": "SoftBank Group Corp. Consolidated Financial Report\nFor the Fiscal Year Ended March 31, 2025",
"page_height": 1684,
"page_number": 1,
"page_width": 1190,
"segment_id": "Amt340p",
"segment_type": "PageHeader"
},
// .. more citations
],
"report_title": [
{
"bboxes": [
{
"height": 34.08674621582031,
"left": 793.8571166992188,
"top": 50.41603088378906,
"width": 343.96636962890625
}
],
"citation_id": "9E7BaA8",
"citation_type": "Segment",
"content": "SoftBank Group Corp. Consolidated Financial Report\nFor the Fiscal Year Ended March 31, 2025",
"page_height": 1684,
"page_number": 1,
"page_width": 1190,
"segment_id": "Amt340p",
"segment_type": "PageHeader"
},
// .. more citations
],
"reporting_as_of_date": [
{
"bboxes": [
{
"height": 34.08674621582031,
"left": 793.8571166992188,
"top": 50.41603088378906,
"width": 343.96636962890625
}
],
"citation_id": "1pR8imf",
"citation_type": "Segment",
"content": "SoftBank Group Corp. Consolidated Financial Report\nFor the Fiscal Year Ended March 31, 2025",
"page_height": 1684,
"page_number": 1,
"page_width": 1190,
"segment_id": "Amt340p",
"segment_type": "PageHeader"
},
// .. more citations
],
"reporting_period_end": [
{
"bboxes": [
{
"height": 34.08674621582031,
"left": 793.8571166992188,
"top": 50.41603088378906,
"width": 343.96636962890625
}
],
"citation_id": "S7ocr4q",
"citation_type": "Segment",
"content": "SoftBank Group Corp. Consolidated Financial Report\nFor the Fiscal Year Ended March 31, 2025",
"page_height": 1684,
"page_number": 1,
"page_width": 1190,
"segment_id": "Amt340p",
"segment_type": "PageHeader"
},
// .. more citations
],
"reporting_period_start": [
{
"bboxes": [
{
"height": 854.3191528320312,
"left": 119.18824005126952,
"top": 209.4170684814453,
"width": 973.8115234375
}
],
"citation_id": "HiQ_2nI",
"citation_type": "Segment",
"content": "\n \n \n | \n Common stock | \n Capital surplus | \n Other equity instruments | \n Retained earnings | \n Treasury stock | \n Accumulated other comprehensive income | \n Total | \n
\n \n \n \n | As of April 1, 2024 | \n 238,772 | \n 3,326,093 | \n 193,199 | \n 1,632,966 | \n (22,725) | \n 5,793,820 | \n 11,162,125 | \n
\n \n | Comprehensive income | \n | \n | \n | \n | \n | \n | \n | \n
\n \n | Net income | \n - | \n - | \n - | \n 1,153,332 | \n - | \n - | \n 1,153,332 | \n
\n \n | Other comprehensive income | \n - | \n - | \n - | \n - | \n - | \n (487,095) | \n (487,095) | \n
\n \n | Total comprehensive income | \n - | \n - | \n - | \n 1,153,332 | \n - | \n (487,095) | \n 666,237 | \n
\n \n | Transactions with owners and other transactions | \n | \n | \n | \n | \n | \n | \n | \n
\n \n | Cash dividends | \n - | \n - | \n - | \n (64,086) | \n - | \n - | \n (64,086) | \n
\n \n | Distribution to owners of other equity instruments | \n - | \n - | \n - | \n (18,867) | \n - | \n - | \n (18,867) | \n
\n \n | Transfer of accumulated other comprehensive income to retained earnings | \n - | \n - | \n - | \n (580) | \n - | \n 580 | \n - | \n
\n \n | Purchase and disposal of treasury stock | \n - | \n - | \n - | \n (973) | \n (233,526) | \n - | \n (234,499) | \n
\n \n | Changes from loss of control | \n - | \n - | \n - | \n - | \n - | \n - | \n - | \n
\n \n | Changes in interests in subsidiaries | \n - | \n 49,732 | \n - | \n - | \n - | \n - | \n 49,732 | \n
\n \n | Issuance of other equity instruments in subsidiaries | \n - | \n - | \n - | \n - | \n - | \n - | \n - | \n
\n \n | Share-based payment transactions | \n - | \n (1,049) | \n - | \n - | \n - | \n - | \n (1,049) | \n
\n \n | Other | \n - | \n 1,948 | \n - | \n - | \n - | \n - | \n 1,948 | \n
\n \n | Total transactions with owners and other transactions | \n - | \n 50,631 | \n - | \n (84,506) | \n (233,526) | \n 580 | \n (266,821) | \n
\n \n | As of March 31, 2025 | \n 238,772 | \n 3,376,724 | \n 193,199 | \n 2,701,792 | \n (256,251) | \n 5,307,305 | \n 11,561,541 | \n
\n \n
",
"page_height": 1684,
"page_number": 7,
"page_width": 1190,
"segment_id": "N2J_j_e",
"segment_type": "Table"
},
// .. more citations
],
"statement_of_cash_flows": {
"currency": [
{
"bboxes": [
{
"height": 17.33056640625,
"left": 932.5235595703124,
"top": 158.1707000732422,
"width": 122.353271484375
}
],
"citation_id": "IIQwBAe",
"citation_type": "Segment",
"content": "(Millions of yen)",
"page_height": 1684,
"page_number": 9,
"page_width": 1190,
"segment_id": "cPeI4-1",
"segment_type": "Text"
},
// .. more citations
],
// .. more items
},
// .. more statements
}
```
```json Metrics expandable theme={"system"}
{
"currency": {
"citation_status": "Created",
"confidence": "High"
},
"entity_name": {
"citation_status": "Created",
"confidence": "High"
},
"report_title": {
"citation_status": "Created",
"confidence": "High"
},
"reporting_as_of_date": {
"citation_status": "Created",
"confidence": "High"
},
"reporting_period_end": {
"citation_status": "Created",
"confidence": "High"
},
"reporting_period_start": {
"citation_status": "Created",
"confidence": "High"
},
"statement_of_cash_flows": {
"currency": {
"citation_status": "Created",
"confidence": "High"
},
"periods": [
{
"closing_cash_balance": {
"citation_status": "Created",
"confidence": "High"
},
"items": [
{
"account_name": {
"citation_status": "Created",
"confidence": "High"
}
}
// .. more items
]
}
// .. more periods
]
}
// .. more metrics
}
```
## Medical Benefits Claim
Extract key medical benefits claim data from medical benefits claim documents.
### Schema
```python Python theme={"system"}
import os
from datetime import date
from decimal import Decimal
from typing import List, Optional
from chunkr_ai import Chunkr
from pydantic import BaseModel
class Address(BaseModel):
line1: Optional[str]
line2: Optional[str]
city: Optional[str]
state: Optional[str]
country: Optional[str]
postal_code: Optional[str]
raw: Optional[str]
class EmployeeInfo(BaseModel):
employer_name: Optional[str]
policy_group_number: Optional[str]
aetna_id: Optional[str]
full_name: Optional[str]
birthdate: Optional[date]
employment_status: Optional[str]
date_of_retirement: Optional[date]
address: Optional[Address]
phone: Optional[str]
class PatientInfo(BaseModel):
full_name: Optional[str]
aetna_id: Optional[str]
birthdate: Optional[date]
relationship_to_employee: Optional[str]
address: Optional[Address]
gender: Optional[str]
marital_status: Optional[str]
employed: Optional[bool]
employer_name: Optional[str]
employer_address: Optional[Address]
class ClaimCircumstances(BaseModel):
accident_related: Optional[bool]
accident_date: Optional[date]
accident_time: Optional[str]
employment_related: Optional[bool]
other_coverage: Optional[bool]
other_insurance_company: Optional[str]
other_policy_number: Optional[str]
other_policy_holder: Optional[str]
class Authorization(BaseModel):
patient_signature: Optional[str]
patient_signature_date: Optional[date]
assignment_of_benefits_signature: Optional[str]
assignment_date: Optional[date]
class FacilityInfo(BaseModel):
name: Optional[str]
address: Optional[Address]
admission_date: Optional[date]
discharge_date: Optional[date]
class Diagnosis(BaseModel):
primary: Optional[str]
secondary: Optional[List[str]]
icd_codes: Optional[List[str]]
class ProcedureEntry(BaseModel):
service_date: Optional[date]
place_of_service: Optional[str]
procedure_code: Optional[str]
description: Optional[str]
type_of_service: Optional[str]
charge: Optional[Decimal]
units: Optional[int]
diagnosis_code: Optional[str]
class PhysicianInfo(BaseModel):
full_name: Optional[str]
address: Optional[Address]
phone: Optional[str]
taxpayer_id: Optional[str]
patient_account_number: Optional[str]
national_provider_identifier: Optional[str]
signature: Optional[str]
signature_date: Optional[date]
class BillingSummary(BaseModel):
total_charge: Optional[Decimal]
amount_paid: Optional[Decimal]
balance_due: Optional[Decimal]
class MedicalBenefitsClaim(BaseModel):
employee: Optional[EmployeeInfo]
patient: Optional[PatientInfo]
claim_circumstances: Optional[ClaimCircumstances]
authorization: Optional[Authorization]
physician: Optional[PhysicianInfo]
facility: Optional[FacilityInfo]
diagnosis: Optional[Diagnosis]
procedures: Optional[List[ProcedureEntry]]
billing: Optional[BillingSummary]
```
```typescript TypeScript (Zod) theme={"system"}
import * as z from "zod";
import Chunkr from "chunkr-ai";
const AddressSchema = z.object({
line1: z.string().optional(),
line2: z.string().optional(),
city: z.string().optional(),
state: z.string().optional(),
country: z.string().optional(),
postal_code: z.string().optional(),
raw: z.string().optional(),
});
const EmployeeInfoSchema = z.object({
employer_name: z.string().optional(),
policy_group_number: z.string().optional(),
aetna_id: z.string().optional(),
full_name: z.string().optional(),
birthdate: z.date().optional(),
employment_status: z.string().optional(),
date_of_retirement: z.date().optional(),
address: AddressSchema.optional(),
phone: z.string().optional(),
});
const PatientInfoSchema = z.object({
full_name: z.string().optional(),
aetna_id: z.string().optional(),
birthdate: z.date().optional(),
relationship_to_employee: z.string().optional(),
address: AddressSchema.optional(),
gender: z.string().optional(),
marital_status: z.string().optional(),
employed: z.boolean().optional(),
employer_name: z.string().optional(),
employer_address: AddressSchema.optional(),
});
const ClaimCircumstancesSchema = z.object({
accident_related: z.boolean().optional(),
accident_date: z.date().optional(),
accident_time: z.string().optional(),
employment_related: z.boolean().optional(),
other_coverage: z.boolean().optional(),
other_insurance_company: z.string().optional(),
other_policy_number: z.string().optional(),
other_policy_holder: z.string().optional(),
});
const AuthorizationSchema = z.object({
patient_signature: z.string().optional(),
patient_signature_date: z.date().optional(),
assignment_of_benefits_signature: z.string().optional(),
assignment_date: z.date().optional(),
});
const FacilityInfoSchema = z.object({
name: z.string().optional(),
address: AddressSchema.optional(),
admission_date: z.date().optional(),
discharge_date: z.date().optional(),
});
const DiagnosisSchema = z.object({
primary: z.string().optional(),
secondary: z.array(z.string()).optional(),
icd_codes: z.array(z.string()).optional(),
});
const ProcedureEntrySchema = z.object({
service_date: z.date().optional(),
place_of_service: z.string().optional(),
procedure_code: z.string().optional(),
description: z.string().optional(),
type_of_service: z.string().optional(),
charge: z.number().optional(),
units: z.number().int().optional(),
diagnosis_code: z.string().optional(),
});
const PhysicianInfoSchema = z.object({
full_name: z.string().optional(),
address: AddressSchema.optional(),
phone: z.string().optional(),
taxpayer_id: z.string().optional(),
patient_account_number: z.string().optional(),
national_provider_identifier: z.string().optional(),
signature: z.string().optional(),
signature_date: z.date().optional(),
});
const BillingSummarySchema = z.object({
total_charge: z.number().optional(),
amount_paid: z.number().optional(),
balance_due: z.number().optional(),
});
const MedicalBenefitsClaimSchema = z.object({
employee: EmployeeInfoSchema.optional(),
patient: PatientInfoSchema.optional(),
claim_circumstances: ClaimCircumstancesSchema.optional(),
authorization: AuthorizationSchema.optional(),
physician: PhysicianInfoSchema.optional(),
facility: FacilityInfoSchema.optional(),
diagnosis: DiagnosisSchema.optional(),
procedures: z.array(ProcedureEntrySchema).optional(),
billing: BillingSummarySchema.optional(),
});
```
### Process
```python Python theme={"system"}
# Convert Pydantic model to JSON schema
schema = MedicalBenefitsClaim.model_json_schema()
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# Create extract task
task = client.tasks.extract.create(
file="https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/medical_benefits_claim.pdf",
schema=schema,
)
```
```typescript TypeScript theme={"system"}
// Convert Zod schema to JSON schema
const schema = z.toJSONSchema(MedicalBenefitsClaimSchema);
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY });
// Create extract task
const task = client.tasks.extract.create({
file: "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/medical_benefits_claim.pdf",
schema: schema,
});
```
### Output
```json Result expandable theme={"system"}
{
"authorization": {
"assignment_date": "2025-02-16",
"assignment_of_benefits_signature": "Anderson MJ",
"patient_signature": "Anderson MJ",
"patient_signature_date": "2025-02-16"
},
"billing": {
"amount_paid": 200,
"balance_due": 1600,
"total_charge": 1800
},
"claim_circumstances": {
"accident_date": "2025-02-12",
"accident_related": true,
"accident_time": "15:30",
"employment_related": false,
"other_coverage": false,
"other_insurance_company": null,
"other_policy_holder": null,
"other_policy_number": null
},
"diagnosis": {
"icd_codes": [
"S52.5",
"S50.1"
],
"primary": "Fractured Radius",
"secondary": [
"Contusion of Forearm"
]
},
"employee": {
"address": {
"city": "Springfield",
"country": null,
"line1": "145 E Indian Blvd",
"line2": null,
"postal_code": null,
"raw": "145 E Indian Blvd, Springfield, IL",
"state": "IL"
},
"aetna_id": "E451958",
"birthdate": "1980-03-14",
"date_of_retirement": null,
"employer_name": "Technova Solutions Inc.",
"employment_status": "Active",
"full_name": "Anderson, Mary J.",
"phone": "2155557890",
"policy_group_number": "A124175"
},
"facility": {
"address": {
"city": "Springfield",
"country": null,
"line1": "451 Health Ave",
"line2": null,
"postal_code": "62704",
"raw": "Springfield Community Hospital, 451 Health Ave, Springfield, IL 62704",
"state": "IL"
},
"admission_date": "2025-02-15",
"discharge_date": "2025-02-16",
"name": "Springfield Community Hospital"
},
"patient": {
"address": {
"city": "Springfield",
"country": null,
"line1": "145 E Indian Blvd",
"line2": null,
"postal_code": null,
"raw": "145 E Indian Blvd, Springfield, IL",
"state": "IL"
},
"aetna_id": "P51309",
"birthdate": "2010-06-24",
"employed": false,
"employer_address": null,
"employer_name": null,
"full_name": "John R.",
"gender": "Male",
"marital_status": "Single",
"relationship_to_employee": "Child"
},
"physician": {
"address": {
"city": "Springfield",
"country": null,
"line1": "3 Health Ave",
"line2": null,
"postal_code": "62705",
"raw": "Springfield Orthopedics, 3 Health Ave, Springfield, IL 62705",
"state": "IL"
},
"full_name": "Dr. Jake Blakey",
"national_provider_identifier": "502357115",
"patient_account_number": "PT-20250215-001",
"phone": "2171527785",
"signature": "Jake B",
"signature_date": "2025-02-16",
"taxpayer_id": "IL-12458"
},
"procedures": [
{
"charge": 250,
"description": "X-ray Forearm",
"diagnosis_code": "S52.5",
"place_of_service": "Outpatient",
"procedure_code": "73090",
"service_date": "2025-02-15",
"type_of_service": "4",
"units": 1
},
{
"charge": 1200,
"description": "Closed Treatment, Radius",
"diagnosis_code": "S52.5",
"place_of_service": "Outpatient",
"procedure_code": "24500",
"service_date": "2025-02-15",
"type_of_service": "2",
"units": 1
},
{
"charge": 350,
"description": "Cast Application",
"diagnosis_code": "S52.5",
"place_of_service": "Outpatient",
"procedure_code": "29125",
"service_date": "2025-02-15",
"type_of_service": "1",
"units": 1
}
]
}
```
```json Citations expandable theme={"system"}
{
"authorization": {
"assignment_date": [
{
"bboxes": [
{
"height": 1757.04248046875,
"left": 165.21835327148438,
"top": 332.1818542480469,
"width": 3398.235107421875
}
],
"citation_id": "RAkPJD2",
"citation_type": "Segment",
"content": "| TO BE COMPLETED BY EMPLOYEE | | | |\n| :--- | :--- | :--- | :--- |\n| 1. Employer's Name
**Technova Solutions Inc.** | 2. Policy/Group Number
**A124175** | | |\n| 3. Employee's Aetna ID Number
**E451958** | 4. Employee's Name
**Anderson, Mary J.** | 5. Employee's Birthdate (MM/DD/YYYY)
**03/14/1980** | |\n| 6. ☑ Active ☐ Retired
Date of Retirement | 7. Employee's Address (include ZIP Code)
**145 E Indian Blvd, Springfield, IL** | ☑ Address is new | 8. Employee's Daytime Telephone Number
**(** **215** **) 555-7890** |\n| 9. Patient's Name
**John R.** | 10. Patient's Aetna ID Number
**P51309** | 11. Patient's Birthdate (MM/DD/YYYY)
**06/24/2010** | 12. Patient's Relationship to Employee
☐ Self ☐ Spouse ☑ Child ☐ Other |\n| 13. Patient's Address (if different from employee) | | | 14. Patient's Gender
☑ Male ☐ Female |\n| 15. Patient's Marital Status
☐ Married ☑ Single | 16. Is patient employed?
☑ No ☐ Yes | 17. Name & Address of Employer | |\n| 18. Is claim related to an accident?
☐ No ☑ Yes If Yes, date **02/12/2025** time **3:30** ☐ am ☑ pm | 19. Is claim related to employment?
☑ No ☐ Yes | | |\n| 20. Are any family members expenses covered by another group health plan, group pre-payment plan (Blue Cross- Blue Shield, etc.), no fault auto insurance, Medicare or any federal, state or local government plan?
☑ No ☐ Yes | 21. If Yes, list policy or contract holder, policy or contract number(s) and name/address of insurance company or administrator: | | |\n| 22. Member's ID Number | 23. Member's Name | 24. Member's Birthdate (MM/DD/YYYY) | |\n| 25. To all providers of health care:
You are authorized to provide Aetna Life Insurance Company or one of its affiliated companies (\"Aetna\"), and any independent claim administrators and consulting health professionals and utilization review organizations with whom Aetna has contracted, information concerning health care advice, treatment or supplies provided the patient (including that relating to mental illness and/or AIDS/ARC/HIV). This information will be used to evaluate claims for benefits. Aetna may provide the employer named above with any benefit calculation used in payment of this claim for the purpose of reviewing the experience and operation of the policy or contract. This authorization is valid for the term of the policy or contract under which a claim has been submitted. I know that I have a right to receive a copy of this authorization upon request and agree that a photographic copy of this authorization is as valid as the original.
Patient's or Authorized Person's Signature _Anderson MJ_
Date **02/16/2025** | | | |\n| 26. I authorize payment of medical benefits to the physician or supplier of service.
Patient's or Authorized Person's Signature _Anderson MJ_
Date **02/16/2025** | | | |\n",
"page_height": 5262,
"page_number": 1,
"page_width": 3720,
"segment_id": "JKcpWlp",
"segment_type": "FormRegion"
},
// .. more citations
],
// .. more items
},
"billing": {
"amount_paid": [
{
"bboxes": [
{
"height": 994.55126953125,
"left": 164.0844268798828,
"top": 3055.027587890625,
"width": 3395.474365234375
}
],
"citation_id": "HHx7h7v",
"citation_type": "Segment",
"content": "| Date of Service | Place of Service* | Procedure Code Identify** | Description of Service | Type of Service † | Charges | Days or Units | Diagnosis Code †† |\n| :-------------- | :---------------- | :------------------------ | :--------------------- | :---------------- | :------ | :------------ | :---------------- |\n| 02/15/202 | Outpatient | 73090 | X-ray Forearm | 4 | $250 | 1 | S52.5 |\n| 02/15/202 | Outpatient | 24500 | Closed Treatment, Radius | 2 | $1200 | 1 | S52.5 |\n| 02/15/202 | Outpatient | 29125 | Cast Application | 1 | $350 | 1 | S52.5 |\n\n| 39. Physician's Name & Address (include ZIP Code) | 40. Telephone Number | 41. Enter the taxpayer identifying number to be used for 1099 reporting purposes. You are required under authority of law to furnish your taxpayer identifying number. |\n| :------------------------------------------------ | :------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| Dr. Jake Blakey, Springfield Orthopedics, 3 Health Ave, Springfield, IL 62705 | ( 217 ) 1527785 | IL-12458 |\n\n| 42. Patient Account Number | 43. Total charge Amount paid Balance due | $ 1800 $ 200 $ 1600 |\n| :------------------------- | :--------------------------------------- | :------------------ |\n| PT-20250215-001 | | |\n\n| 44. Physician's or Supplier's Signature | 45. National Provider Identifier | 46. Date |\n| :-------------------------------------- | :------------------------------- | :------- |\n| [Signature image] | 502357115 | 02/16/2025 |\n",
"page_height": 5262,
"page_number": 1,
"page_width": 3720,
"segment_id": "LSu9eG3",
"segment_type": "FormRegion"
},
// .. more citations
],
// .. more items
},
"claim_circumstances": {
"accident_date": [
{
"bboxes": [
{
"height": 1757.04248046875,
"left": 165.21835327148438,
"top": 332.1818542480469,
"width": 3398.235107421875
}
],
"citation_id": "lXDxgTh",
"citation_type": "Segment",
"content": "| TO BE COMPLETED BY EMPLOYEE | | | |\n| :--- | :--- | :--- | :--- |\n| 1. Employer's Name
**Technova Solutions Inc.** | 2. Policy/Group Number
**A124175** | | |\n| 3. Employee's Aetna ID Number
**E451958** | 4. Employee's Name
**Anderson, Mary J.** | 5. Employee's Birthdate (MM/DD/YYYY)
**03/14/1980** | |\n| 6. ☑ Active ☐ Retired
Date of Retirement | 7. Employee's Address (include ZIP Code)
**145 E Indian Blvd, Springfield, IL** | ☑ Address is new | 8. Employee's Daytime Telephone Number
**(** **215** **) 555-7890** |\n| 9. Patient's Name
**John R.** | 10. Patient's Aetna ID Number
**P51309** | 11. Patient's Birthdate (MM/DD/YYYY)
**06/24/2010** | 12. Patient's Relationship to Employee
☐ Self ☐ Spouse ☑ Child ☐ Other |\n| 13. Patient's Address (if different from employee) | | | 14. Patient's Gender
☑ Male ☐ Female |\n| 15. Patient's Marital Status
☐ Married ☑ Single | 16. Is patient employed?
☑ No ☐ Yes | 17. Name & Address of Employer | |\n| 18. Is claim related to an accident?
☐ No ☑ Yes If Yes, date **02/12/2025** time **3:30** ☐ am ☑ pm | 19. Is claim related to employment?
☑ No ☐ Yes | | |\n| 20. Are any family members expenses covered by another group health plan, group pre-payment plan (Blue Cross- Blue Shield, etc.), no fault auto insurance, Medicare or any federal, state or local government plan?
☑ No ☐ Yes | 21. If Yes, list policy or contract holder, policy or contract number(s) and name/address of insurance company or administrator: | | |\n| 22. Member's ID Number | 23. Member's Name | 24. Member's Birthdate (MM/DD/YYYY) | |\n| 25. To all providers of health care:
You are authorized to provide Aetna Life Insurance Company or one of its affiliated companies (\"Aetna\"), and any independent claim administrators and consulting health professionals and utilization review organizations with whom Aetna has contracted, information concerning health care advice, treatment or supplies provided the patient (including that relating to mental illness and/or AIDS/ARC/HIV). This information will be used to evaluate claims for benefits. Aetna may provide the employer named above with any benefit calculation used in payment of this claim for the purpose of reviewing the experience and operation of the policy or contract. This authorization is valid for the term of the policy or contract under which a claim has been submitted. I know that I have a right to receive a copy of this authorization upon request and agree that a photographic copy of this authorization is as valid as the original.
Patient's or Authorized Person's Signature _Anderson MJ_
Date **02/16/2025** | | | |\n| 26. I authorize payment of medical benefits to the physician or supplier of service.
Patient's or Authorized Person's Signature _Anderson MJ_
Date **02/16/2025** | | | |\n",
"page_height": 5262,
"page_number": 1,
"page_width": 3720,
"segment_id": "JKcpWlp",
"segment_type": "FormRegion"
},
{
"bboxes": [
{
"height": 42.3504638671875,
"left": 838.3680419921875,
"top": 1198.1663818359375,
"width": 245.69281005859375
}
],
"citation_id": "rP_peLy",
"citation_type": "Word",
"content": "02/12/2025",
"page_height": 5262,
"page_number": 1,
"page_width": 3720,
"segment_type": "Text"
}
],
// .. more items
"other_insurance_company": null,
"other_policy_holder": null,
"other_policy_number": null
},
"diagnosis": {
"icd_codes": [
[
{
"bboxes": [
{
"height": 356.846435546875,
"left": 178.343994140625,
"top": 2652.0048828125,
"width": 1702.4400634765625
}
],
"citation_id": "Yg3yi0-",
"citation_type": "Segment",
"content": "Springfield 62704 Community IL Springfield, Hospital, Ave, 451 Health 37. Diagnosis or nature of illness or injury (please secondary) indicate primary and Radius (ICD-10 S52.5) Fractured 1. 2. Contusion S50.1) of Forearm (ICD-10 3. 4.",
"page_height": 5262,
"page_number": 1,
"page_width": 3720,
"segment_id": "B3r1dMj",
"segment_type": "Text"
},
{
"bboxes": [
{
"height": 44.107177734375,
"left": 734.3712158203125,
"top": 2775.801513671875,
"width": 123.407958984375
}
],
"citation_id": "bdwQSHk",
"citation_type": "Word",
"content": "S52.5)",
"page_height": 5262,
"page_number": 1,
"page_width": 3720,
"segment_type": "Text"
},
{
"bboxes": [
{
"height": 994.55126953125,
"left": 164.0844268798828,
"top": 3055.027587890625,
"width": 3395.474365234375
}
],
"citation_id": "qOOUXsQ",
"citation_type": "Segment",
"content": "| Date of Service | Place of Service* | Procedure Code Identify** | Description of Service | Type of Service † | Charges | Days or Units | Diagnosis Code †† |\n| :-------------- | :---------------- | :------------------------ | :--------------------- | :---------------- | :------ | :------------ | :---------------- |\n| 02/15/202 | Outpatient | 73090 | X-ray Forearm | 4 | $250 | 1 | S52.5 |\n| 02/15/202 | Outpatient | 24500 | Closed Treatment, Radius | 2 | $1200 | 1 | S52.5 |\n| 02/15/202 | Outpatient | 29125 | Cast Application | 1 | $350 | 1 | S52.5 |\n\n| 39. Physician's Name & Address (include ZIP Code) | 40. Telephone Number | 41. Enter the taxpayer identifying number to be used for 1099 reporting purposes. You are required under authority of law to furnish your taxpayer identifying number. |\n| :------------------------------------------------ | :------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| Dr. Jake Blakey, Springfield Orthopedics, 3 Health Ave, Springfield, IL 62705 | ( 217 ) 1527785 | IL-12458 |\n\n| 42. Patient Account Number | 43. Total charge Amount paid Balance due | $ 1800 $ 200 $ 1600 |\n| :------------------------- | :--------------------------------------- | :------------------ |\n| PT-20250215-001 | | |\n\n| 44. Physician's or Supplier's Signature | 45. National Provider Identifier | 46. Date |\n| :-------------------------------------- | :------------------------------- | :------- |\n| [Signature image] | 502357115 | 02/16/2025 |\n",
"page_height": 5262,
"page_number": 1,
"page_width": 3720,
"segment_id": "LSu9eG3",
"segment_type": "FormRegion"
},
{
"bboxes": [
{
"height": 44.467041015625,
"left": 3064.593505859375,
"top": 3214.814453125,
"width": 141.796875
}
],
"citation_id": "qDvQzGU",
"citation_type": "Word",
"content": "S52.5",
"page_height": 5262,
"page_number": 1,
"page_width": 3720,
"segment_type": "Text"
}
], // .. more icd codes
// .. more citations
],
// .. more items
},
"employee": {
"address": {
"city": [
{
"bboxes": [
{
"height": 1757.04248046875,
"left": 165.21835327148438,
"top": 332.1818542480469,
"width": 3398.235107421875
}
],
"citation_id": "GLYPlMg",
"citation_type": "Segment",
"content": "| TO BE COMPLETED BY EMPLOYEE | | | |\n| :--- | :--- | :--- | :--- |\n| 1. Employer's Name
**Technova Solutions Inc.** | 2. Policy/Group Number
**A124175** | | |\n| 3. Employee's Aetna ID Number
**E451958** | 4. Employee's Name
**Anderson, Mary J.** | 5. Employee's Birthdate (MM/DD/YYYY)
**03/14/1980** | |\n| 6. ☑ Active ☐ Retired
Date of Retirement | 7. Employee's Address (include ZIP Code)
**145 E Indian Blvd, Springfield, IL** | ☑ Address is new | 8. Employee's Daytime Telephone Number
**(** **215** **) 555-7890** |\n| 9. Patient's Name
**John R.** | 10. Patient's Aetna ID Number
**P51309** | 11. Patient's Birthdate (MM/DD/YYYY)
**06/24/2010** | 12. Patient's Relationship to Employee
☐ Self ☐ Spouse ☑ Child ☐ Other |\n| 13. Patient's Address (if different from employee) | | | 14. Patient's Gender
☑ Male ☐ Female |\n| 15. Patient's Marital Status
☐ Married ☑ Single | 16. Is patient employed?
☑ No ☐ Yes | 17. Name & Address of Employer | |\n| 18. Is claim related to an accident?
☐ No ☑ Yes If Yes, date **02/12/2025** time **3:30** ☐ am ☑ pm | 19. Is claim related to employment?
☑ No ☐ Yes | | |\n| 20. Are any family members expenses covered by another group health plan, group pre-payment plan (Blue Cross- Blue Shield, etc.), no fault auto insurance, Medicare or any federal, state or local government plan?
☑ No ☐ Yes | 21. If Yes, list policy or contract holder, policy or contract number(s) and name/address of insurance company or administrator: | | |\n| 22. Member's ID Number | 23. Member's Name | 24. Member's Birthdate (MM/DD/YYYY) | |\n| 25. To all providers of health care:
You are authorized to provide Aetna Life Insurance Company or one of its affiliated companies (\"Aetna\"), and any independent claim administrators and consulting health professionals and utilization review organizations with whom Aetna has contracted, information concerning health care advice, treatment or supplies provided the patient (including that relating to mental illness and/or AIDS/ARC/HIV). This information will be used to evaluate claims for benefits. Aetna may provide the employer named above with any benefit calculation used in payment of this claim for the purpose of reviewing the experience and operation of the policy or contract. This authorization is valid for the term of the policy or contract under which a claim has been submitted. I know that I have a right to receive a copy of this authorization upon request and agree that a photographic copy of this authorization is as valid as the original.
Patient's or Authorized Person's Signature _Anderson MJ_
Date **02/16/2025** | | | |\n| 26. I authorize payment of medical benefits to the physician or supplier of service.
Patient's or Authorized Person's Signature _Anderson MJ_
Date **02/16/2025** | | | |\n",
"page_height": 5262,
"page_number": 1,
"page_width": 3720,
"segment_id": "JKcpWlp",
"segment_type": "FormRegion"
},
// .. more citations
],
// .. more items
},
// .. more items
},
"facility": {
"address": {
"city": [
{
"bboxes": [
{
"height": 356.846435546875,
"left": 178.343994140625,
"top": 2652.0048828125,
"width": 1702.4400634765625
}
],
"citation_id": "NnSB0Qs",
"citation_type": "Segment",
"content": "Springfield 62704 Community IL Springfield, Hospital, Ave, 451 Health 37. Diagnosis or nature of illness or injury (please secondary) indicate primary and Radius (ICD-10 S52.5) Fractured 1. 2. Contusion S50.1) of Forearm (ICD-10 3. 4.",
"page_height": 5262,
"page_number": 1,
"page_width": 3720,
"segment_id": "B3r1dMj",
"segment_type": "Text"
},
// .. more citations
],
// .. more items
},
// .. more items
},
// .. more items
}
```
```json Metrics expandable theme={"system"}
{
"authorization": {
"assignment_date": {
"citation_status": "Created",
"confidence": "High"
},
"assignment_of_benefits_signature": {
"citation_status": "Created",
"confidence": "High"
},
"patient_signature": {
"citation_status": "Created",
"confidence": "High"
},
"patient_signature_date": {
"citation_status": "Created",
"confidence": "High"
}
},
"billing": {
"amount_paid": {
"citation_status": "Created",
"confidence": "High"
},
"balance_due": {
"citation_status": "Created",
"confidence": "High"
},
"total_charge": {
"citation_status": "Created",
"confidence": "High"
}
},
"claim_circumstances": {
"accident_date": {
"citation_status": "Created",
"confidence": "High"
},
"accident_related": {
"citation_status": "Created",
"confidence": "High"
},
"accident_time": {
"citation_status": "Created",
"confidence": "High"
},
"employment_related": {
"citation_status": "Created",
"confidence": "High"
},
"other_coverage": {
"citation_status": "Created",
"confidence": "High"
},
"other_insurance_company": {
"citation_status": "Skipped",
"confidence": "High"
},
"other_policy_holder": {
"citation_status": "Skipped",
"confidence": "High"
},
"other_policy_number": {
"citation_status": "Skipped",
"confidence": "High"
}
},
"diagnosis": {
"icd_codes": {
"citation_status": "Created",
"confidence": "High"
},
"primary": {
"citation_status": "Created",
"confidence": "High"
},
"secondary": {
"citation_status": "Created",
"confidence": "High"
}
},
"employee": {
"address": {
"city": {
"citation_status": "Created",
"confidence": "High"
},
// .. more items
},
"aetna_id": {
"citation_status": "Created",
"confidence": "High"
},
"birthdate": {
"citation_status": "Created",
"confidence": "High"
},
"date_of_retirement": {
"citation_status": "Skipped",
"confidence": "High"
},
"employer_name": {
"citation_status": "Created",
"confidence": "High"
},
"employment_status": {
"citation_status": "Created",
"confidence": "High"
},
"full_name": {
"citation_status": "Created",
"confidence": "High"
},
"phone": {
"citation_status": "Created",
"confidence": "High"
},
"policy_group_number": {
"citation_status": "Created",
"confidence": "High"
}
},
"facility": {
"address": {
"city": {
"citation_status": "Created",
"confidence": "High"
},
// .. more items
},
"admission_date": {
"citation_status": "Created",
"confidence": "High"
},
"discharge_date": {
"citation_status": "Created",
"confidence": "High"
},
"name": {
"citation_status": "Created",
"confidence": "High"
}
},
"patient": {
"address": {
"city": {
"citation_status": "Created",
"confidence": "High"
},
// .. more items
},
"aetna_id": {
"citation_status": "Created",
"confidence": "High"
},
// .. more items
},
"physician": {
"address": {
"city": {
"citation_status": "Created",
"confidence": "High"
},
// .. more items
},
"full_name": {
"citation_status": "Created",
"confidence": "High"
},
"national_provider_identifier": {
"citation_status": "Created",
"confidence": "High"
},
"patient_account_number": {
"citation_status": "Created",
"confidence": "High"
},
"phone": {
"citation_status": "Created",
"confidence": "High"
},
"signature": {
"citation_status": "Created",
"confidence": "Low"
},
"signature_date": {
"citation_status": "Created",
"confidence": "High"
},
"taxpayer_id": {
"citation_status": "Created",
"confidence": "High"
}
},
"procedures": [
{
"charge": {
"citation_status": "Created",
"confidence": "High"
},
"description": {
"citation_status": "Created",
"confidence": "High"
},
"diagnosis_code": {
"citation_status": "Created",
"confidence": "High"
},
"place_of_service": {
"citation_status": "Created",
"confidence": "High"
},
"procedure_code": {
"citation_status": "Created",
"confidence": "High"
},
"service_date": {
"citation_status": "Created",
"confidence": "Low"
},
"type_of_service": {
"citation_status": "Created",
"confidence": "High"
},
"units": {
"citation_status": "Created",
"confidence": "High"
}
}
// .. more procedures
]
}
```
## Best Practices
### Schema Design Tips
1. **Use Descriptive Field Names**: Choose clear, unambiguous field names that reflect the actual data being extracted.
2. **Handle Optional Fields Appropriately**: Mark fields as optional when they may not be present in all documents.
3. **Include Field Descriptions**: Use Pydantic's `Field(description="...")` or Zod's `.describe()` to provide context.
4. **Use Appropriate Data Types**: Choose the right types (string, number, boolean, array, date) for each field.
### Extraction Optimization
1. **Custom System Prompts**: Tailor prompts to your document type. See the [system prompt parameter](/api-references/tasks/create-extract-task#body-system-prompt) for details.
2. **Quality Validation**: Use citation trails alongside confidence scores to validate extractions.
# Extract Outputs
Source: https://docs.chunkr.ai/pages/features/extract/outputs
Understanding data returned by the Extract feature
## High-Level Structure
```mermaid theme={"system"}
graph TD;
Task["Task
(task_id, status, ...)"] --> Output["Output Object"];
Output --> Results["results
(Your JSON Schema, populated)"];
Output --> Citations["citations { }"];
Output --> Metrics["metrics { }"];
Results --> Field1["invoice_number: 'INV-2024-001'
vendor: {...}
line_items: [...]"];
Citations --> CitationMirror["Mirrors results structure"];
CitationMirror --> CitationLeaf["Leaf Field → Array of Citation Entries"];
CitationLeaf --> CitationEntry["Citation Entry
- citation_type
- content
- page_number
- bboxes[]"];
Metrics --> MetricsMap["Field Path → Metrics Object"];
MetricsMap --> ConfidenceValue["Metrics Object
- confidence ('High'/'Low')
- citation_status ('Created'/'Failed'/'Skipped')"];
```
Extract returns a `Task` object. When processing is successful, the `output` field contains your custom JSON schema—fully populated with extracted values. Alongside your results, the output includes `citations` and `metrics` that mirror your schema and can be addressed using **field paths**.
* **`results`**: Your exact JSON schema structure, filled with extracted data.
* **`citations`**: Mirrors the `results` shape. At every leaf field (a primitive value) you will find an array of citation objects. For arrays of primitives, `citations` is an array where each element holds the citation array for that index or `null` when no citations were created for that element.
* **`metrics`**: Mirrors the `results` shape. At every leaf field, you will find a metrics object. For arrays of primitives, `metrics` is an array where each element holds the metrics object for that index.
### Primitives and Final Items
In this documentation, a primitive is any JSON value that is one of: `null`, `boolean`, `number`, or `string`.
We consider a field a final item when it holds a primitive value. For arrays of primitives, each element is treated as its own final item (e.g., `tags[0]`, `tags[1]`). Citations and metrics are generated at the level of each final item.
### How Field Paths Work
Field paths use dot notation for nested objects and bracket notation for arrays:
* **Top-level fields**: `invoice_number`, `total_amount`
* **Nested object fields**: `vendor.vendor_name`, `vendor.contact_email`
* **Array item fields**: `line_items[0].item_description`, `line_items[1].unit_price`
These paths can be used to address values within both the `citations` and `metrics` objects (which mirror `results`), allowing you to programmatically link each extracted value back to its source location and confidence level.
```json expandable Example Output theme={"system"}
{
"task_id": "extract-8b7e7e8a-...",
"status": "Succeeded",
"output": {
"results": {
"invoice_number": "INV-2024-001",
"invoice_date": "2024-03-15",
"due_date": "2024-04-15",
"vendor": {
"vendor_name": "Acme Corp",
"vendor_id": "ACME-001",
"contact_email": "billing@acme.com",
"phone_number": "+1-555-123-4567",
"address": "1 Acme Way, Metropolis, NY 10001"
},
"line_items": [
{ "item_description": "Widget A", "quantity": 10, "unit_price": 12.5, "line_total": 125.0 },
{ "item_description": "Widget B", "quantity": 4, "unit_price": 50.0, "line_total": 200.0 }
],
"subtotal": 325.0,
"tax_amount": 26.0,
"total_amount": 351.0,
"payment_terms": "Net 30",
"tags": ["Overdue", "International"]
},
"citations": {
"invoice_number": [
{
"citation_type": "Segment",
"content": "Invoice # INV-2024-001",
"segment_type": "Text",
"page_number": 1,
"page_width": 792,
"page_height": 612,
"bboxes": [ { "left": 450, "top": 120, "width": 100, "height": 20 } ]
}
],
"vendor": {
"vendor_name": [
{
"citation_type": "Segment",
"content": "Acme Corp",
"segment_type": "Text",
"page_number": 1,
"page_width": 792,
"page_height": 612,
"bboxes": [ { "left": 100, "top": 200, "width": 200, "height": 18 } ]
}
]
},
"line_items": [
{
"item_description": [
{
"citation_type": "Segment",
"content": "Widget A",
"segment_type": "Text",
"page_number": 1,
"page_width": 792,
"page_height": 612,
"bboxes": [ { "left": 100, "top": 350, "width": 80, "height": 15 } ]
}
]
},
{
"item_description": [
{
"citation_type": "Segment",
"content": "Widget B",
"segment_type": "Text",
"page_number": 1,
"page_width": 792,
"page_height": 612,
"bboxes": [ { "left": 100, "top": 370, "width": 80, "height": 15 } ]
}
],
"quantity": [
{
"citation_type": "Segment",
"content": "4",
"segment_type": "Text",
"page_number": 1,
"page_width": 792,
"page_height": 612,
"bboxes": [ { "left": 400, "top": 370, "width": 30, "height": 15 } ]
}
]
}
],
"tags": [
[
{
"citation_type": "Segment",
"content": "Overdue",
"segment_type": "Text",
"page_number": 1,
"page_width": 792,
"page_height": 612,
"bboxes": [ { "left": 120, "top": 160, "width": 80, "height": 16 } ]
}
],
[
{
"citation_type": "Segment",
"content": "International",
"segment_type": "Text",
"page_number": 1,
"page_width": 792,
"page_height": 612,
"bboxes": [ { "left": 210, "top": 160, "width": 120, "height": 16 } ]
}
]
]
},
"metrics": {
"invoice_number": { "confidence": "High", "citation_status": "Created" },
"vendor": {
"vendor_name": { "confidence": "High", "citation_status": "Created" },
"contact_email": { "confidence": "Low", "citation_status": "Created" }
},
"line_items": [
{
"item_description": { "confidence": "High", "citation_status": "Created" },
"quantity": { "confidence": "High", "citation_status": "Created" }
},
{
"item_description": { "confidence": "High", "citation_status": "Created" },
"quantity": { "confidence": "High", "citation_status": "Created" }
}
],
"tags": [
{ "confidence": "High", "citation_status": "Created" },
{ "confidence": "High", "citation_status": "Created" }
],
"total_amount": { "confidence": "High", "citation_status": "Created" }
}
}
}
```
***
## Key Output Fields
### 1. `results`
The `results` object contains your extracted data, structured exactly according to the JSON schema you provided. Every field, nested object, and array element follows your schema definition, making it simple to integrate into your application logic.
### 2. `citations`
Citations provide full traceability for each extracted value. The `citations` object mirrors your `results` structure. At each leaf (primitive value), you will find an array of citation objects. For arrays of primitives, `citations` is an array where each element holds the citation array for that index (e.g., `tags[0]`, `tags[1]`) or `null` if no citations were created for that element. A single field can have multiple citation types depending on the document source and extraction granularity.
**Citation Granularity**:
* **Segment-level citations** are always provided. These reference semantic elements like paragraphs, tables, or text blocks that support the extracted value.
* **Word-level citations** may also be included for finer-grained traceability. When word-level citations are present, segment-level citations will also be included in the same array.
For spreadsheets, we also provide cell range and sheet name information in Segment and Word-level citations inside the `ss_ranges` and `ss_sheet_name` fields.
**Citation Object Fields**:
* **`citation_id`**: Unique identifier for this citation.
* **`citation_type`**: The citation granularity: `"Segment"` or `"Word"`.
* **`content`**: The content supporting the extraction. For `Segment` citations, this is the HTML/Markdown `content` from the Parse output (e.g., HTML for a table). For `Word` citations, it's the raw OCR text.
* **`segment_type`**: The type of segment (e.g., `"Text"`, `"Table"`, `"Title"`). Only present for segment-level citations.
* **`segment_id`**: Identifier linking back to the original segment from Parse. Only present for segment-level citations.
* **`page_number`**: The page where the citation appears.
* **`page_width`, `page_height`**: Page dimensions in pixels for normalizing bounding boxes.
* **`bboxes`**: Array of bounding box objects (`{ left, top, width, height }` in pixels) pinpointing the exact location(s) on the page.
* **`ss_ranges`**: Array of cell ranges in A1 notation (e.g., `["A1:C10"]`). Only present for spreadsheet citations.
* **`ss_sheet_name`**: The sheet name where the data was found. Only present for spreadsheet citations.
```json Segment Citation theme={"system"}
{
"citation_id": "abc1234",
"citation_type": "Segment",
"content": "Invoice # INV-2024-001",
"segment_id": "seg_001", // Only present for segment citations
"segment_type": "Text",
"page_number": 1,
"page_height": 612,
"page_width": 792,
"bboxes": [
{ "left": 450, "top": 120, "width": 100, "height": 20 }
],
"ss_ranges": ["D15:E20"], // Only present if file is a spreadsheet
"ss_sheet_name": "Invoice" // Only present if file is a spreadsheet
}
```
```json Word Citation theme={"system"}
{
"citation_id": "word5678",
"citation_type": "Word",
"content": "INV-2024-001",
"segment_type": "Text", // Always `Text` for word citations
"page_number": 1,
"page_height": 612,
"page_width": 792,
"bboxes": [
{ "left": 465, "top": 122, "width": 85, "height": 16 }
],
"ss_ranges": ["D15"], // Only present if file is a spreadsheet
"ss_sheet_name": "Invoice" // Only present if file is a spreadsheet
}
```
Citations enable powerful use cases:
* **Document Viewers**: Highlight the exact source text when a user clicks on an extracted field.
* **Validation Workflows**: Let human reviewers verify extracted values against their original context.
* **Audit Trails**: Track which parts of a document contributed to each data point.
* **Spreadsheet Navigation**: Jump directly to source cells in spreadsheet viewers using `ss_ranges`.
### 3. `metrics`
The `metrics` object mirrors your schema structure and provides metrics for each extracted field. It contains:
* **`confidence`**: `High` or `Low`, indicating if the value is supported by citations.
* **`citation_status`**: `Created`, `Failed`, or `Skipped`, indicating the status of citation generation.
Looking for a complete schema of all output fields? See our [API Reference](/api-references/tasks/get-extract-task).
# Extract Overview
Source: https://docs.chunkr.ai/pages/features/extract/overview
Transform documents into structured data with granular citations
The Extract feature transforms parsed documents into structured data based on your defined schema.
Each extracted value comes with granular citations and confidence.
It takes parsed document output (or performs parsing automatically) and intelligently fills your custom JSON schema with precise data extraction, complete with source citations and confidence metrics for every extracted field.
## Key Features
* **Schema-driven extraction**: Define your exact data structure using JSON Schema and get perfectly formatted results.
* **Granular citations**: Every extracted value includes precise source references to the original document location.
* **Confidence scoring**: Built-in confidence metrics for each extracted field to assess reliability.
* **Flexible input options**: Works with existing parse tasks, raw documents, or remote URLs.
* **Intelligent field mapping**: Automatically identifies and maps document content to your schema fields.
***
Extract builds on top of Parse. If you provide a raw document, a parse task will be created automatically, and then the extract task will be created using the parse task ID.
See [API Reference](/api-references/tasks/create-extract-task) for more details on how to configure the parse task that will be automatically created.
## How It Works
1. **Input Processing**: Extract accepts either a raw document (URL, file upload, or base64) or a reference to an existing parse task.
2. **Schema Analysis**: Your JSON schema is analyzed to understand the target data structure and field requirements.
3. **Intelligent Extraction**: The system maps document content to your schema fields using AI.
4. **Citation & Scoring**: Each extracted value is annotated with source citations and confidence.
5. **Structured Output**: Returns your data in the exact schema format with enriched metadata.
### Make a JSON Schema
Use Pydantic or Zod to define your schema, then pass the generated JSON schema to Extract.
```python Python theme={"system"}
import os
from typing import List, Optional
from chunkr_ai import Chunkr
from pydantic import BaseModel
class Vendor(BaseModel):
vendor_name: str
vendor_id: Optional[str] = None
contact_email: Optional[str] = None
phone_number: Optional[str] = None
address: Optional[str] = None
class InvoiceLineItem(BaseModel):
item_description: str
quantity: float
unit_price: float
line_total: float
class Invoice(BaseModel):
invoice_number: str
invoice_date: str
due_date: str
vendor: Vendor
line_items: List[InvoiceLineItem]
subtotal: float
tax_amount: float
total_amount: float
payment_terms: Optional[str] = None
# Convert Pydantic model to JSON schema
schema = Invoice.model_json_schema()
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
url = "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/invoice.pdf"
task = client.tasks.extract.create(
file=url, schema=schema
) # Pass the schema to the extract task
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
import * as z from "zod";
const VendorSchema = z.object({
vendor_name: z.string(),
vendor_id: z.string().optional(),
contact_email: z.string().optional(),
phone_number: z.string().optional(),
address: z.string().optional(),
});
const InvoiceLineItemSchema = z.object({
item_description: z.string(),
quantity: z.number(),
unit_price: z.number(),
line_total: z.number(),
});
const InvoiceSchema = z.object({
invoice_number: z.string(),
invoice_date: z.string(),
due_date: z.string(),
vendor: VendorSchema,
line_items: z.array(InvoiceLineItemSchema),
subtotal: z.number(),
tax_amount: z.number(),
total_amount: z.number(),
payment_terms: z.string().optional(),
});
// Convert Zod schema to JSON schema
const schema = z.toJSONSchema(InvoiceSchema);
const client = new Chunkr({
apiKey: process.env.CHUNKR_API_KEY,
});
const url = "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/invoice.pdf";
const task = await client.tasks.extract.create({
file: url,
schema: schema,
});
```
### Input Options
* From a URL, a local upload using `client.files.create`, base64, or from an existing parse task ID.
```python Python theme={"system"}
import os
import time
from chunkr_ai import Chunkr
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# From URL
task = client.tasks.extract.create(
file="https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/invoice.pdf",
schema=schema,
)
# From local file (upload-first)
with open("path/to/doc.pdf", "rb") as f:
up = client.files.create(file=f)
task2 = client.tasks.extract.create(file=up.url, schema=schema)
# From base64
task3 = client.tasks.extract.create(
file="data:application/pdf;base64,...", schema=schema
)
# From an existing parse task
parse_task = client.tasks.parse.get(task_id="parse_task_id")
task4 = client.tasks.extract.create(file=parse_task.task_id, schema=schema)
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
import fs from "fs";
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY! });
// From URL
const task = await client.tasks.extract.create({
file: "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/invoice.pdf",
schema,
});
// From local file (upload-first)
const fileStream = fs.createReadStream("path/to/doc.pdf");
const up = await client.files.create({
file: fileStream,
file_metadata: JSON.stringify({ name: "doc.pdf", type: "application/pdf" }),
});
const task2 = await client.tasks.extract.create({ file: up.url, schema });
// From base64
const task3 = await client.tasks.extract.create({
file: "data:application/pdf;base64,...",
schema,
});
// From an existing parse task
const parseTask = await client.tasks.parse.get("parse_task_id");
const task4 = await client.tasks.extract.create({
file: parseTask.task_id,
schema,
});
```
When referencing an existing parse task, you cannot provide `parse_configuration` or `file_name` parameters, as these are inherited from the original parse task.
Extract supports all Parse configuration options when processing raw documents, plus extraction-specific settings:
### Extraction Configuration
* **Schema (`schema`)**: Your JSON Schema definition that describes the target data structure. Required field.
* **System Prompt (`system_prompt`)**: Customize the LLM prompt for extraction. Default: "You are an expert at structured data extraction. You will be given parsed text from a document and should convert it into the given structure."
* **Task Expiration (`expires_in`)**: Set automatic cleanup time in seconds for completed tasks.
For an overview of Parse configuration options, see [Parse Configuration](/pages/features/parse/overview#advanced-configuration).
***
## Best Practices
1. **Schema Design**: Create clear, well-structured schemas with descriptive field names to improve extraction accuracy.
2. **Type Specificity**: Use appropriate JSON Schema types (string, number, boolean, array, object) and formats (date, email, uri) for better results.
3. **Include Field Descriptions**: Use Pydantic's `Field(description="...")` or Zod's `.describe()` to provide context.
4. **Parse Task Reuse**: When extracting multiple schemas from the same document, parse once and reference the task ID for efficiency.
5. **Citation Verification**: Use the provided citations to build audit trails and allow users to verify extracted data against source documents.
# Advanced Cases
Source: https://docs.chunkr.ai/pages/features/parse/advanced-cases
Configuring the Parse feature for advanced use cases
This guide covers advanced configurations for the Parse feature to handle a variety of specialized use cases and requirements.
## Extended Context: Handling Distant Legends
Elements like tables or charts might rely on context from other parts of the page. For example, a chart's legend could be located in a different corner of the document that isn't picked up when the cropped chart is sent to a VLM.
For these scenarios, the best practice is to enable `extended_context`. This provides the VLM with the full page image with the cropped segment as context.
Here’s how to enable it for `Table` and `Picture` segments:
```python Python theme={"system"}
import os
from chunkr_ai import Chunkr
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# Parse with extended context for tables and pictures
task = client.tasks.parse.create(
file="https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/construction.pdf",
segment_processing={
"table": {"extended_context": True},
"picture": {"extended_context": True},
},
)
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY! });
// Parse with extended context for tables and pictures
const task = await client.tasks.parse.create({
file: "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/construction.pdf",
segment_processing: {
Table: {
extended_context: true,
},
Picture: {
extended_context: true,
},
},
});
```
***
## Full-Page VLM: Bypassing Layout Analysis
For documents where layout analysis struggles, or for simple documents where it's unnecessary, you can bypass layout analysis entirely.
By setting the `segmentation_strategy` to `Page`, you can instruct Chunkr to process the entire page with a Vision Language Model (VLM) and generate Markdown directly.
This approach is highly effective for:
* **Layout analysis failure**: In the rare case that layout analysis struggles with a document's structure.
* **Simple Documents**: Tiny, text-only, and uniform documents (e.g., receipts) where layout analysis offers no benefit and simple OCR is sufficient for bounding boxes.
Here’s how to enable it:
```python Python theme={"system"}
import os
from chunkr_ai import Chunkr
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# Force full-page VLM processing for Markdown output
task = client.tasks.parse.create(
file="https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/receipt.pdf",
segmentation_strategy="Page",
)
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY! });
// Force full-page VLM processing for Markdown output
const task = await client.tasks.parse.create({
file: "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/receipt.pdf",
segmentation_strategy: "Page",
});
```
***
## Disabling Chunking for Non-RAG Workflows
If you're using Chunkr for data extraction, document analysis, or other non-RAG workflows, you may want to disable chunking entirely.
When chunking is disabled, each chunk in the output will contain exactly one segment.
To disable chunking, set `target_length` to `0` in the `chunk_processing` configuration:
```python Python theme={"system"}
import os
from chunkr_ai import Chunkr
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# Disable chunking for extraction workflows
task = client.tasks.parse.create(
file="https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/receipt.pdf",
chunk_processing={
"target_length": 0 # Disables chunking
},
)
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY! });
// Disable chunking for extraction workflows
const task = await client.tasks.parse.create({
file: "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/receipt.pdf",
chunk_processing: {
target_length: 0, // Disables chunking
},
});
```
***
## Optimizing for speed
The most significant factor affecting processing time is VLM processing. By default, Chunkr uses VLM processing for the following segment types to ensure high-quality data extraction:
* **Tables**
* **Images**
* **Forms**
* **Legends**
* **Formulas**
If high-quality data extraction is not critical for certain segment types in your use case, you can disable VLM processing for those segments to significantly improve processing speed.
For example, if your document contains images that are decorative or not essential to extract, you can disable VLM processing for images:
```python Python theme={"system"}
import os
from chunkr_ai import Chunkr
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# Disable VLM processing for images to optimize for speed
task = client.tasks.parse.create(
file="https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/doc.pdf",
segment_processing={
"picture": {"strategy": "Auto"},
},
)
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY! });
// Disable VLM processing for images to optimize for speed
const task = await client.tasks.parse.create({
file: "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/doc.pdf",
segment_processing: {
Picture: {
strategy: "Auto",
},
},
});
```
You can disable VLM processing for multiple segment types by adding them to the `segment_processing` configuration. This allows you to balance speed and quality based on your specific requirements.
***
## Extracting Text Styling
By default, text segments are processed with OCR which captures the content but loses formatting information. If you need to preserve text styling such as bold, italicization, font colors, and other formatting details, you can enable VLM processing for text segments.
This is useful for use cases like:
* **Redlining**: Tracking changes and formatting in legal documents
* **Document comparison**: Identifying styling differences between versions
* **Accessibility**: Preserving semantic meaning conveyed through formatting
Enabling VLM processing for text segments significantly increases processing time, as text segments are the most common segment type in documents.
Here's how to enable text styling extraction:
```python Python theme={"system"}
import os
from chunkr_ai import Chunkr
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# Enable VLM processing for text to capture styling
task = client.tasks.parse.create(
file="https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/doc.pdf",
segment_processing={
"text": {"strategy": "LLM"},
},
)
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY! });
// Enable VLM processing for text to capture styling
const task = await client.tasks.parse.create({
file: "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/doc.pdf",
segment_processing: {
Text: {
strategy: "LLM",
},
},
});
```
***
## Ignoring Segment Types
When you only need specific types of content from your documents, you can ignore certain segment types entirely. This is useful for:
* Focusing on specific content types (e.g., only tables and charts)
* Removing unwanted elements (e.g., headers, footers, page numbers)
* Simplifying output for targeted extraction workflows
For example, if you only want to extract tables and ignore all other content:
```python Python theme={"system"}
import os
from chunkr_ai import Chunkr
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# Extract only tables, ignore everything else
task = client.tasks.parse.create(
file="https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/doc.pdf",
segment_processing={
"caption": {"strategy": "Ignore"},
"footnote": {"strategy": "Ignore"},
"form_region": {"strategy": "Ignore"},
"formula": {"strategy": "Ignore"},
"graphical_item": {"strategy": "Ignore"},
"legend": {"strategy": "Ignore"},
"line_number": {"strategy": "Ignore"},
"list_item": {"strategy": "Ignore"},
"page": {"strategy": "Ignore"},
"page_footer": {"strategy": "Ignore"},
"page_header": {"strategy": "Ignore"},
"page_number": {"strategy": "Ignore"},
"picture": {"strategy": "Ignore"},
"text": {"strategy": "Ignore"},
"title": {"strategy": "Ignore"},
},
)
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY! });
// Extract only tables, ignore everything else
const task = await client.tasks.parse.create({
file: "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/doc.pdf",
segment_processing: {
Caption: { strategy: "Ignore" },
Footnote: { strategy: "Ignore" },
FormRegion: { strategy: "Ignore" },
Formula: { strategy: "Ignore" },
GraphicalItem: { strategy: "Ignore" },
Legend: { strategy: "Ignore" },
LineNumber: { strategy: "Ignore" },
ListItem: { strategy: "Ignore" },
Page: { strategy: "Ignore" },
PageFooter: { strategy: "Ignore" },
PageHeader: { strategy: "Ignore" },
PageNumber: { strategy: "Ignore" },
Picture: { strategy: "Ignore" },
Text: { strategy: "Ignore" },
Title: { strategy: "Ignore" },
},
});
```
Alternatively, you can selectively ignore just a few segment types while keeping the rest. The following example is for rmeoving headers and footers for RAG chunks:
```python Python theme={"system"}
import os
from chunkr_ai import Chunkr
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# Ignore headers and footers
task = client.tasks.parse.create(
file="https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/doc.pdf",
segment_processing={
"page_header": {"strategy": "Ignore"},
"page_footer": {"strategy": "Ignore"},
},
)
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY! });
// Ignore headers and footers
const task = await client.tasks.parse.create({
file: "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/doc.pdf",
segment_processing: {
PageHeader: { strategy: "Ignore" },
PageFooter: { strategy: "Ignore" },
},
});
```
# Parse Outputs
Source: https://docs.chunkr.ai/pages/features/parse/outputs
Understanding data returned by the Parse feature.
## High-Level Structure
```mermaid theme={"system"}
graph TD;
Task["Task
(task_id, status, ...)"] --> Output["Output Object"];
Output --> Metadata["Metadata
- file_name
- mime_type
- page_count
- pdf_url"];
Output --> Chunks["chunks [ ]"];
Output --> Pages["pages [ ]"];
Chunks --> Chunk["Chunk
- chunk_id
- chunk_length
- embed
- segments[]"];
Chunk --> Segment["Segment
- content
- description
- bbox
- ss_* fields"
...];
Pages --> Page["Page
- page_number
- image
- pg_width, pg_height
- dpi
- ss_sheet_name"];
```
Parse returns a `Task` object. When processing is successful, the `output` field contains the HTML/Markdown representation of your document.
The core of this output is a list of `chunks`, which are composed of individual `segments`.
```json Top-Level Output theme={"system"}
{
"task_id": "8b7e7e8a-...",
"status": "Succeeded",
"output": {
"file_name": "document.pdf",
"page_count": 2,
"chunks": [
// ... array of chunk objects
],
"pages": [
// ... array of page objects with full-page images
]
},
// ... other task metadata
}
```
### Chunks and Segments
The document is first broken down into `segments`, which represent individual semantic elements like a paragraph, table, or title. These segments are then grouped into `chunks` based on your chunking configuration.
* **Segments**: The smallest building blocks. Each segment corresponds to a single, identified element from the source document.
* **Chunks**: A logical grouping of one or more segments. Each chunk includes `chunk_length`, `content`, `embed`, and `segments[]`. For RAG applications, chunks are the units of information that are typically embedded and retrieved.
| Segment Type | Description |
| --------------- | ---------------------------------------------------------------------- |
| `Caption` | Descriptive text for images, tables, or figures |
| `Footnote` | Reference notes at the bottom of a page |
| `Formula` | Mathematical expressions and equations |
| `FormRegion` | Group of form fields and input areas |
| `GraphicalItem` | Small visual elements like logos, QR codes, barcodes, and stamps |
| `Legend` | Keys or legends for charts, graphs, and images |
| `LineNumber` | Line numbers in legal documents, patents, and technical specifications |
| `ListItem` | Bullet points or numbered list entries |
| `PageFooter` | Footer content at the bottom of a page |
| `PageHeader` | Header content at the top of a page |
| `PageNumber` | Page numbering text |
| `Picture` | Images, charts, and graphs |
| `Table` | Tabular data with rows and columns |
| `Text` | Regular paragraph text |
| `Title` | Document or section titles |
| `Unknown` | Unclassified content |
| `Page` | Full page content when layout analysis is disabled |
By default, no segments are ignored. To change this behaviour you can
[adjust your configuration to ignore specific segments like headers and footers for RAG](/pages/features/parse/advanced-cases#ignoring-segment-types).
***
## Key Output Fields
Each `segment` object contains rich information. At the `chunk` level, corresponding fields are concatenated from all of the segments within that chunk. Here are the most important fields:
### 1. `content`
The `content` field holds the primary, structured representation of the segment. Each segment is formatted based on it's type.
* **Tables**: Converted to HTML to maintain complex col/row-span structure.
* **Images**: Converted to a robust markdown description, with charts/graphs including a tabular representation.
* **Forms**: Converted to HTML for structured key-value representation.
* **Legends**: Converted into a key-value markdown table.
* **Formulas**: Converted to LaTeX strings for perfect mathematical representation. They can even be embedded within an HTML table if a formula appears inside a cell.
* **GraphicalItems**: Simple OCR extraction of any text present.
* **Text-type**: Text-heavy segments like title, section headers, list-items and text blocks are converted into markdown.
```json Segment with HTML Table and LaTeX theme={"system"}
{
"segment_type": "Table",
"content": "| The formula is: | \\( E=mc^2 \\) |
"
}
```
These are the default conversion formats for each segment type. You can adjust this behavior via [segment processing controls](/api-references/tasks/create-parse-task#body-segment-processing).
### 2. `embed`
The `embed` field provides the clean, RAG-optimized text that should be used for generating embeddings.
* It includes the `content` and, if present, the `description`. This helps optimize the table segments without contaminating the content field.
* This is the field used for calculating token counts when chunking, ensuring chunks fit your target length.
```json Chunk with embed field theme={"system"}
{
"chunk_id": "chunk-1-...",
"chunk_length": 45,
"embed": "The table shows a 15% increase in Q2 revenue for Widget A... | Product | Q1 | Q2 | ...",
"segments": [ /* ... */ ]
}
```
You can customize the tokenizer and chunk length via [chunk processing controls](/api-references/tasks/create-parse-task#body-chunk-processing).
### 3. `bbox` (Bounding Box)
Every segment includes a precise bounding box (`bbox`) that pinpoints its exact location on the original page. This is essential for building applications that require citations or highlighting.
* The coordinates (`left`, `top`, `width`, `height`) are pixel (`px`) values in the page coordinate space. For resolution‑independent rendering, normalize them using the page dimensions — for example, `left_pct = left / page_width`, `top_pct = top / page_height`, and likewise for `width`/`height`. Use `page_width`/`page_height` on the segment (or the page's `pg_width`/`pg_height`).
* The `dpi` in the `pages` array describes the pixel resolution of the rendered page image. You do not need `dpi` when using normalized percentages; it is helpful only when mapping directly to a specific raster image in pixels or when generating images at a different scale.
```json Segment with Bounding Box theme={"system"}
{
"segment_type": "Text",
"bbox": { "left": 100, "top": 250, "width": 500, "height": 50 },
"page_number": 1,
"page_height": 1584,
"page_width": 1224,
...
}
```
```json Page with DPI theme={"system"}
{
"dpi": 144,
"image": "https://chunkr.ai/page_1.jpg",
"page_number": 1,
"page_height": 1584,
"page_width": 1224,
...
}
```
***
## Spreadsheet-Specific Outputs (`ss_*`) Preview
This feature is in preview. Occasionally, extremely large spreadsheets can
fail. In that case, we still return HTML, but layout analysis is not
performed.
When processing spreadsheet files (`.xlsx`, `.xls`), the output includes additional `ss_*` prefixed fields that provide native Excel context. These fields exist alongside the standard `content`, `embed`, and `bbox` fields, enriching each segment with its precise location and native data from the original spreadsheet.
### Key Fields
* **`ss_range`**: The cell range for the segment in A1 notation (e.g., `A1:D10`).
* **`ss_cells`**: A detailed array of each cell in the segment, including its original formula, value, text and styling. Allows you to see both the raw formula (`=SUM(B2:B10)`) and its calculated result (`$55,000`).
* **`ss_header_*`**: Fields identifying the detected header for a table, such as `ss_header_range`, `ss_header_text`, `ss_header_bbox`, and `ss_header_ocr`. Headers are intelligently associated even if they are not directly adjacent to the table.
These spreadsheet-native values unlock powerful capabilities:
* **Create Interactive Experiences**: Use `ss_range` to build native citation experiences that let users click data and jump to the precise source cell in a viewer.
* **Get Cleaner LLM Context**: Combine layout analysis with precise cell data to identify tables, associate headers, and filter out irrelevant cells. This provides cleaner, more meaningful context for LLM processing.
* **Build Powerful Spreadsheet Agents**: Use the `ss_*` fields to build AI agents that can read, analyze, and even **write** back to spreadsheets. Understanding cell formulas and values enables agents to automate tasks like updating financial models, correcting entries, or adding new rows.
***
Looking for additional output fields? See the Advanced Outputs section below
for more metadata options.
Beyond the key fields discussed above, the Parse output is enriched with a variety of other useful metadata at the file, page, and segment levels. Here are some of the most valuable advanced fields:
* **Word-Level Bounding Boxes**: Included in the `ocr` array for each page, this provides the precise coordinates for every single word detected by the OCR process. This is ideal for building applications that require highlighting specific words or phrases in a document viewer.
* **Cropped Segment Images**: Each segment object contains an `image` field with a URL to a cropped image of just that segment. This is incredibly useful for providing visual context to an LLM or displaying the source of a specific chunk of text.
* **File & Page Metadata**: The top-level `output` object contains file-level metadata like the original `file_name`, `mime_type`, and `page_count`. Additionally, the `pages` array contains detailed information for each page, including a full-page image URL, dimensions (`page_width`, `page_height`), and DPI.
For a comprehensive breakdown of every field available in the output, please refer to our [API Reference](/api-references/tasks/get-task#response-output).
# Parse Overview
Source: https://docs.chunkr.ai/pages/features/parse/overview
Convert documents into LLM-ready data
The Parse feature transforms complex documents into machine-readable data, optimized for LLMs.
It intelligently identifies document elements, processes them based on their type, and outputs clean HTML & Markdown content ready for AI applications and downstream workflow automation.
## Key Features
* [**Perfect Markdown & HTML**](/pages/features/parse/outputs#1-content): LLM-ready content (Markdown, HTML, tables, etc).
* **Reading order intact**: Maintains the natural reading flow for complex layouts.
* [**Granular bounding boxes**](/pages/features/parse/outputs#3-bbox-bounding-box): Pinpoints element coordinates with precision for easy citations.
* [**Native Spreadsheet handling**](/pages/features/parse/outputs#spreadsheet-specific-outputs-ss): 100% reconstruction with formulas, styling, and cell values preserved; precise ranges; cleans tables and converts charts to structured data.
* [**Post-processing**](/pages/features/parse/outputs#5-post-processing): Token-aware chunking, cropped images, and more.
## Example: Parse and access chunk content
Here's how you can parse a document and access its chunks using our SDKs.
```python Python theme={"system"}
import os
import time
from chunkr_ai import Chunkr
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# Parse a document from URL
url = "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/doc.pdf"
task = client.tasks.parse.create(file=url)
# OR parse from local file
with open("path/to/doc.pdf", "rb") as f:
file = client.files.create(file=f)
task = client.tasks.parse.create(file=file.url)
print(f"Task created with ID: {task.task_id}")
# Wait for the task to complete
while not task.completed:
task = client.tasks.parse.get(task_id=task.task_id)
print(f"Task {task.task_id} is {task.status}")
time.sleep(3)
# Access the chunks from the output
if task.status == "Succeeded" and task.output is not None:
for chunk in task.output.chunks:
print(chunk.content)
else: # Could be "Failed" or "Cancelled"
print(f"Task status: {task.status}")
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
import fs from "fs";
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY! });
// Parse a document from URL
const url = "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/doc.pdf";
let task = await client.tasks.parse.create({ file: url });
// OR parse from local file
const fileStream = fs.createReadStream("path/to/doc.pdf");
const uploadedFile = await client.files.create({
file: fileStream,
file_metadata: JSON.stringify({
name: "doc.pdf",
type: "application/pdf",
}),
});
task = await client.tasks.parse.create({
file: uploadedFile.url,
});
console.log(`Task created with ID: ${task.task_id}`);
// Wait for the task to complete
while (!task.completed) {
task = await client.tasks.parse.get(task.task_id);
console.log(`Task ${task.task_id} is ${task.status}`);
await new Promise((resolve) => setTimeout(resolve, 3000));
}
// Access the chunks from the output
if (task.status == "Succeeded") {
for (const chunk of task.output?.chunks || []) {
console.log(chunk.content);
}
} else { // Could be "Failed" or "Cancelled"
console.log(`Task Status: ${task.status}`);
}
```
***
Our default configuration is optimized through extensive testing and provides
excellent results for most documents. You can customize parse if you have
specific requirements.
For a comprehensive breakdown of every available configuration, please refer to our [API Reference](/api-references/tasks/create-task). Here is an overview of our configuration options:
* **Pipeline (`pipeline`)**: Choose the provider (`Azure` or `Chunkr`) for layout analysis and OCR models.
* **Layout Analysis & OCR**:
* *Segmentation Strategy (`segmentation_strategy`)*: Choose between `LayoutAnalysis` (default) or a full-page VLM approach for parsing.
* *OCR Strategy (`ocr_strategy`)*: Use `Auto` to selectively apply OCR or `All` to force it on every page.
* **Segment-level Customization (`segment_processing`)**: Control processing for each document element (e.g., `Text`, `Table`, `Picture`):
* *Processing Strategy (`strategy`)*: For each segment, set the strategy to generate HTML/Markdown. `Auto` (simple OCR + logic), `LLM` (VLM generation), or `Ignore` (remove from output).
* *Format Control (`format`)*: Control the output format (`Markdown` or `HTML`) for segment content.
* *Extended Context (`extended_context`)*: Provide the full page image as additional context for VLM processing of a segment. Useful for cases like distant legends for tables and pictures.
* *Cropped Images (`crop_image`)*: Control if a cropped image of the segment is included.
* **Chunking (`chunk_processing`)**: Configure chunking strategy, sizes, and token-counting model.
* **Error Handling (`error_handling`)**: Set to `Fail` (default) to stop on any error, or `Continue` to process despite non-critical errors.
# Supported File types
Source: https://docs.chunkr.ai/pages/get-started/file-types
A comprehensive list of all compatible file types
Below is a list of the most commonly used file formats. For the complete list of all supported file types and MIME types, visit [api.chunkr.ai/file-types](https://api.chunkr.ai/file-types).
## Documents
| File Type | Extension | MIME Type |
| :---------------------- | :-------- | :------------------------------------------------------------------------ |
| Adobe PDF | `.pdf` | `application/pdf` |
| Microsoft Word | `.docx` | `application/vnd.openxmlformats-officedocument.wordprocessingml.document` |
| Microsoft Word (Legacy) | `.doc` | `application/msword` |
## Presentations
| File Type | Extension | MIME Type |
| :---------------------------- | :-------- | :-------------------------------------------------------------------------- |
| Microsoft PowerPoint | `.pptx` | `application/vnd.openxmlformats-officedocument.presentationml.presentation` |
| Microsoft PowerPoint (Legacy) | `.ppt` | `application/vnd.ms-powerpoint` |
## Spreadsheets
| File Type | Extension | MIME Type |
| :----------------------- | :-------- | :------------------------------------------------------------------ |
| Microsoft Excel | `.xlsx` | `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` |
| Microsoft Excel (Legacy) | `.xls` | `application/vnd.ms-excel` |
## Images
| File Type | Extension(s) | MIME Type(s) |
| :-------- | :-------------- | :------------------------ |
| JPEG | `.jpg`, `.jpeg` | `image/jpeg`, `image/jpg` |
| PNG | `.png` | `image/png` |
| GIF | `.gif` | `image/gif` |
| WebP | `.webp` | `image/webp` |
| TIFF | `.tiff` | `image/tiff` |
| BMP | `.bmp` | `image/bmp` |
| SVG | `.svg` | `image/svg` |
| HEIC | `.heic` | `image/heic` |
| HEIF | `.heif` | `image/heif` |
| AVIF | `.avif` | `image/avif` |
## Text
| File Type | Extension(s) | MIME Type(s) |
| :--------- | :------------------------- | :-------------- |
| Plain Text | `.txt` | `text/plain` |
| HTML | `.html`, `.htm` | `text/html` |
| Markdown | `.md`, `.markdown`, `.mkd` | `text/markdown` |
***
## Domains
Chunkr is designed to process a wide variety of documents across numerous domains. Here are some examples:
| Domain | Examples |
| :---------------- | :------------------------------------------------------------------------------ |
| **Medical** | Patient records, charts, prescriptions, hospital forms, EOBs |
| **Real Estate** | Property listings, deeds, appraisal reports, lease agreements, MLS sheets |
| **Education** | Homework, exams, syllabi, lecture notes, worksheets |
| **Construction** | Blueprints, architectural drawings, plans, permits, inspection reports |
| **Financial** | Annual reports, SEC filings, bank statements, loan applications, prospectuses |
| **Billing** | Invoices, POs, receipts, utility bills, account statements |
| **Tax** | Tax forms (W2, 1040), returns, official documents |
| **Supply Chain** | Bills of lading, packing slips, manifests, inventory reports, POs, PODs |
| **Procurement** | RFPs, RFIs, RFQs, bids, proposals, SOWs |
| **Legal** | Contracts, court filings, briefs, NDAs, ToS, deeds, wills |
| **Government** | Official forms, regulations, public notices, legislative docs, census forms |
| **Technical** | Manuals, specifications, datasheets, engineering drawings, code docs |
| **Research** | Academic papers, scientific articles, thesis, study reports (excluding patents) |
| **Patent** | Official patent filings/grants with abstract, claims, drawings |
| **Consulting** | Presentations, proposals, reports, case studies |
| **Magazine** | Articles, multi-column layouts, image-heavy content |
| **Newspaper** | News articles, editorials, classifieds |
| **Textbook** | Educational chapters with diagrams, exercises, specific formatting |
| **Historical** | Archived documents, letters, manuscripts, old records |
| **Miscellaneous** | ID cards, resumes, certificates, flyers, brochures, menus |
## Request a File Type/Domain
If you don't see a file type you need, please [send us an email](mailto:support@chunkr.ai?subject=File%20Type%20Request) to request it. We are always looking to expand our supported formats.
# LLM Documentation
Source: https://docs.chunkr.ai/pages/get-started/llm-docs
LLM-ready dev documentation for Chunkr AI
## Available Formats
We offer two primary formats for LLMs:
* **Condensed Documentation**: [https://docs.chunkr.ai/llms.txt](https://docs.chunkr.ai/llms.txt)
Streamlined version optimized for quick reference by LLMs. These are also helpful for MCP servers.
* **Full Documentation**: [https://docs.chunkr.ai/llms-full.txt](https://docs.chunkr.ai/llms-full.txt)
Complete documentation with all details and examples. Can be dumped directly into context.
## How to Use
[Here](https://youtu.be/fk2WEVZfheI) is a helpful video on how to integrate llm.txt and MCP servers.
# Developer Quickstart
Source: https://docs.chunkr.ai/pages/get-started/quickstart
Get started in 2 minutes
Follow these steps to set up your account and integrate with our API.
1. Visit [Chunkr AI](https://chunkr.ai)
2. Click on "Login" and create your account
3. Once logged in, navigate to "API Keys" in the dashboard
The Python SDK is currently in alpha. The `--pre` flag is required to install pre-release versions.
```bash Python theme={"system"}
pip install --pre chunkr-ai
```
```bash TypeScript theme={"system"}
npm install chunkr-ai
```
```python Python theme={"system"}
import os
import time
from chunkr_ai import Chunkr
from pydantic import BaseModel
# Initialize the client
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# Create a parse task using a file URL
parse_task = client.tasks.parse.create(
file="https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/invoice.pdf"
)
# Alternatively, upload a local file first
# with open('path/to/doc.pdf', 'rb') as f:
# uploaded = client.files.create(file=f)
# parse_task = client.tasks.parse.create(file=uploaded.url)
# Wait for parse task to complete
while not parse_task.completed:
parse_task = client.tasks.parse.get(task_id=parse_task.task_id)
print(f"Parse Status: {parse_task.status}")
time.sleep(3)
if parse_task.status == "Succeeded":
# Do something with the output
pass
else: # Could be "Failed" or "Cancelled"
print(f"Parse Status: {parse_task.status}")
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
import fs from "fs";
import * as z from "zod";
// Initialize the client
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY! });
// Create a parse task using a file URL
let parseTask = await client.tasks.parse.create({
file: "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/invoice.pdf",
});
// Alternatively, upload a local file first
// const fileStream = fs.createReadStream("path/to/doc.pdf");
// const uploaded = await client.files.create({
// file: fileStream,
// file_metadata: JSON.stringify({ name: "doc.pdf", type: "application/pdf" }),
// });
// let parseTask = await client.tasks.parse.create({ file: uploaded.url });
// Wait for parse task to complete
while (!parseTask.completed) {
parseTask = await client.tasks.parse.get(parseTask.task_id);
console.log(`Parse Status: ${parseTask.status}`);
await new Promise((resolve) => setTimeout(resolve, 3000));
}
if (parseTask.status === "Succeeded") {
// Do something with the output
} else { // Could be "Failed" or "Cancelled"
console.log(`Parse Status: ${parseTask.status}`);
}
```
```python Python theme={"system"}
class Invoice(BaseModel):
invoice_number: str
invoice_date: str
total_amount: float
# Use the parse task ID to create an extract task
extract_task = client.tasks.extract.create(
file=parse_task.task_id,
schema=Invoice.model_json_schema() # Convert Pydantic model to JSON schema
)
# Wait for extract task to complete
while not extract_task.completed:
extract_task = client.tasks.extract.get(task_id=extract_task.task_id)
print(f"Extract Status: {extract_task.status}")
time.sleep(3)
```
```typescript TypeScript theme={"system"}
const Invoice = z.object({
invoice_number: z.string().min(1),
invoice_date: z.string().min(1),
total_amount: z.number().min(1),
});
let extractTask = await client.tasks.extract.create({
file: parseTask.task_id,
schema: z.toJSONSchema(Invoice), // Convert Zod schema to JSON schema
});
// Wait for extract task to complete
while (!extractTask.completed) {
extractTask = await client.tasks.extract.get(extractTask.task_id);
console.log(`Extract Status: ${extractTask.status}`);
await new Promise((resolve) => setTimeout(resolve, 3000));
}
```
```python Python theme={"system"}
# Get parse results and print first 5 chunk contents
if parse_task.output is not None:
for chunk in parse_task.output.chunks[:5]:
if chunk.content is not None:
print(chunk.content[:200])
# Get extract results and print schema fields
if extract_task.status == "Succeeded" and extract_task.output is not None:
# Validate the results against the schema
invoice = Invoice.model_validate(extract_task.output.results)
# Do something with the invoice
print(invoice)
```
```typescript TypeScript theme={"system"}
// Get parse results and print first 5 chunk contents
if (parseTask.output) {
for (const chunk of parseTask.output?.chunks.slice(0, 5)) {
console.log(chunk.content?.slice(0, 200));
}
}
// Get extract results and print schema fields
if (extractTask.status === "Succeeded" && extractTask.output) {
const invoice = Invoice.parse(extractTask.output?.results);
// Do something with the invoice
console.log(invoice);
}
```
You can also explore the output [through our web interface](/pages/get-started/web-interface) in more detail.
```python Python theme={"system"}
import os
import time
from chunkr_ai import Chunkr
from pydantic import BaseModel
# Initialize the client
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# Create a parse task using a file URL
parse_task = client.tasks.parse.create(
file="https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/invoice.pdf"
)
# Alternatively, upload a local file first
# with open('path/to/doc.pdf', 'rb') as f:
# uploaded = client.files.create(file=f)
# parse_task = client.tasks.parse.create(file=uploaded.url)
# Wait for parse task to complete
while not parse_task.completed:
parse_task = client.tasks.parse.get(task_id=parse_task.task_id)
print(f"Parse Status: {parse_task.status}")
time.sleep(3)
if parse_task.status == "Succeeded":
# Do something with the output
pass
else: # Could be "Failed" or "Cancelled"
print(f"Parse Status: {parse_task.status}")
class Invoice(BaseModel):
invoice_number: str
invoice_date: str
total_amount: float
extract_task = client.tasks.extract.create(
file=parse_task.task_id,
schema=Invoice.model_json_schema() # Convert Pydantic model to JSON schema
)
# Wait for extract task to complete
while not extract_task.completed:
extract_task = client.tasks.extract.get(task_id=extract_task.task_id)
print(f"Extract Status: {extract_task.status}")
time.sleep(3)
# Get parse results and print first 5 chunk contents
if parse_task.output is not None:
for chunk in parse_task.output.chunks[:5]:
if chunk.content is not None:
print(chunk.content[:200])
# Get extract results and print schema fields
if extract_task.status == "Succeeded" and extract_task.output is not None:
invoice = Invoice.model_validate(extract_task.output.results)
# Do something with the invoice
print(invoice)
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
import fs from "fs";
import * as z from "zod";
// Initialize the client
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY! });
// Create a parse task using a file URL
let parseTask = await client.tasks.parse.create({
file: "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/invoice.pdf",
});
// Alternatively, upload a local file first
// const fileStream = fs.createReadStream("path/to/doc.pdf");
// const uploaded = await client.files.create({
// file: fileStream,
// file_metadata: JSON.stringify({ name: "doc.pdf", type: "application/pdf" }),
// });
// let parseTask = await client.tasks.parse.create({ file: uploaded.url });
// Wait for parse task to complete
while (!parseTask.completed) {
parseTask = await client.tasks.parse.get(parseTask.task_id);
console.log(`Parse Status: ${parseTask.status}`);
await new Promise((resolve) => setTimeout(resolve, 3000));
}
if (parseTask.status === "Succeeded") {
// Do something with the output
} else { // Could be "Failed" or "Cancelled"
console.log(`Parse Status: ${parseTask.status}`);
}
const Invoice = z.object({
invoice_number: z.string().min(1),
invoice_date: z.string().min(1),
total_amount: z.number().min(1),
});
let extractTask = await client.tasks.extract.create({
file: parseTask.task_id,
schema: z.toJSONSchema(Invoice), // Convert Zod schema to JSON schema
});
// Wait for extract task to complete
while (!extractTask.completed) {
extractTask = await client.tasks.extract.get(extractTask.task_id);
console.log(`Extract Status: ${extractTask.status}`);
await new Promise((resolve) => setTimeout(resolve, 3000));
}
// Get parse results and print first 5 chunk contents
if (parseTask.output) {
for (const chunk of parseTask.output?.chunks.slice(0, 5)) {
console.log(chunk.content?.slice(0, 200));
}
}
// Get extract results and print schema fields
if (extractTask.status === "Succeeded" && extractTask.output) {
const invoice = Invoice.parse(extractTask.output?.results);
// Do something with the invoice
console.log(invoice);
}
```
## Next Steps
Get to production with our task system and webhooks.
Learn how to handle tasks in production.
Receive real-time notifications.
# Web Interface
Source: https://docs.chunkr.ai/pages/get-started/web-interface
Create tasks and visually inspect output quality without writing any code
The Chunkr web interface provides an intuitive way to test document processing, and evaluate output quality - all without writing a single line of code.
Use the dashboard to visually inspect parsing accuracy, verify extraction results, and see exactly how your documents are being processed.
## Creating and Viewing Tasks
Follow these steps to create an extract task and explore the interactive viewers.
When you first visit the [Chunkr dashboard](https://app.chunkr.ai), you'll see the **Tasks** tab with welcome text and a call-to-action. This is where all your tasks - whether created via the UI or API - will appear.
Click the **Create Task** button to get started.
Walk through the task creation workflow to extract structured data from your document:
Click **Process Documents** to begin processing. You'll be automatically redirected to the **Tasks** tab where a task table appears showing:
* **Two tasks in Processing state**: a Parse task (for document segmentation) and an Extract task (for structured data extraction)
* Chunkr automatically creates both tasks when you submit an Extract request
Once your tasks complete, click any row in the table to open the corresponding viewer.
Once processing completes, click on any task row to explore your results in purpose-built viewers designed for quality inspection:
## Additional Features
The web interface also provides tools to manage your account:
* **Manage API keys**
* **Usage**: Monitor your document processing usage and limits
* **Billing**: Manage your subscription and payment methods
## Why use this?
The Chunkr dashboard is perfect for:
* **Testing before integration**: Try different documents/tasks and see results immediately
* **Quality assurance**: Visually verify parsing and extraction accuracy
* **Troubleshooting**: Identify processing issues
## Next Steps
Get started in under 2 minutes
Powerful Python and Typescript libraries
Turn any document into LLM-ready data. Markdown, bounding boxes, etc.
Auto-fill custom schemas. Citations, confidence scores, structured JSON.
# Welcome to Chunkr
Source: https://docs.chunkr.ai/pages/get-started/welcome
Complex, messy documents to high-quality data
Chunkr turns complex documents like PDFs, spreadsheets, and images into clean data - fast, accurate, and at scale.
We build industry leading VLMs + computer-vision models to deliver structured, machine-readable outputs with unmatched accuracy.
This guide contains everything you need to understand Chunkr. If anything is missing,
we're here to help at [support@chunkr.ai](mailto:support@chunkr.ai).
## Get Started
Get started in under 2 minutes
Powerful Python and Typescript libraries
Test documents and view results instantly
***
## Features
A simple, task-based API that gives you full control over your document
ingestion.
Turn any document into LLM-ready data. Markdown, bounding boxes, etc.
Auto-fill custom schemas. Citations, confidence scores, structured JSON.
***
## What can I do with Chunkr?
Chunkr is built for any AI and developer team that works with messy documents
at scale.
You can process a [vast array of document](/pages/get-started/file-types) types across any industry. Use cases are wide and varied, here are some of the things folks build with our outputs:
### Standout AI Applications
* **Intelligent RAG systems**: Feed your Retrieval-Augmented Generation pipelines with perfectly chunked, application-ready content from any document.
* **Power document-first applications**: Leverage bounding boxes, citations, and precise OCR to build visual search tools, verification interfaces, and interactive document experiences.
* **Create specialized AI agents**: Develop sophisticated agents that can reason over and extract insights from legal contracts, financial reports, spreadsheets, or scientific papers.
### Automate Critical Workflows
* **Finance**: Automate data entry and accelerate financial analysis by processing high volumes of invoices, bank statements, and 10-K/10-Q reports.
* **Legal**: Streamline compliance and legal review by automating data extraction from regulatory filings, contracts, and evidence documents with fully auditable, citation-backed results.
* **Supply Chain**: Digitize and process bills of lading, packing slips, and purchase orders to enhance logistics, reduce manual errors, and speed up your supply chain.
***
## Security and Trust
Security is at the core of our platform. We offer a SOC 2 and HIPAA-compliant service, never train on your data, and provide on-premise solutions for maximum control. We also maintain backwards compatibility to ensure a stable, reliable platform you can depend on.
Explore our comprehensive security policies and commitment to data privacy.
Learn about our on-premise offerings for maximum data control and security.
# Deploy Chunkr on your infrastructure
Source: https://docs.chunkr.ai/pages/security/on-premise
Deploy Chunkr in your own infrastructure for maximum data control, compliance, and security.
Perfect for enterprises with strict data governance requirements.
## Why On-Premise?
* **Data Sovereignty**: Your documents never leave your infrastructure
* **Compliance**: Meet regulatory requirements
* **Custom Security**: Integrate with your existing security stack
## Quick Start
Contact our sales team to get started with on-premise deployment:
* **Email**: [support@chunkr.ai](mailto:support@chunkr.ai)
* **Support**: Dedicated support throughout deployment and operation
## What's Included
**Everything you need to get started**:
* Docker container for all Chunkr services ready for deployment
* License for running Chunkr within your infrastructure
* Continuous updates to keep your deployment current
* Direct access to our support engineers
**Available add-ons**:
* Kubernetes orchestration with Helm charts
* Simplified deployment through Docker Compose
* Production-ready monitoring and observability tools
## System Requirements
The exact system requirements depend on your volume and throughput requirements.
* **Minimum**: 8 CPU cores, 16GB RAM, 1TB SSD
* **GPU**: Optional but recommended for higher throughput
* **Network**: Outbound HTTPS for model updates (can be air-gapped)
# Security & Trust at Chunkr
Source: https://docs.chunkr.ai/pages/security/policies
The security and privacy of your data are foundational to our platform. We are committed to providing a secure environment for our customers, and this commitment is reflected in our architecture, policies, and the compliance certifications we maintain.
## The Lifecycle of Your Data
When you submit a file to Chunkr, it undergoes a carefully controlled lifecycle designed to maximize security and privacy:
1. **Secure Upload**: Your data is transmitted to our platform over encrypted TLS channels. Whether you upload a file directly, provide a URL, or send a base64-encoded string, your data is protected in transit.
2. **Ephemeral Storage & Zero Data Retention**: Upon receipt, your file is stored in our secure, access-controlled GCS (Google Cloud Storage) and Cloud SQL via GCP (Google Cloud Platform). For maximum security and privacy, you can configure a custom expiration time for each task. Once this period expires, all associated data - including original files, outputs, and any temporary assets - is permanently deleted from our servers. We keep minimal information for billing and auditing. For more details on how to configure this, see our guide on [Data Retention](/pages/task-system/task-handling#data-retention).
3. **Data Segregation**: We maintain strict logical separation of data between our customers. Your data is never commingled with that of other customers.
4. **Data Usage**: For customers on Scale tier or above, we will **never** use your data to train our models.
## Our Comprehensive Security Framework
To provide a multi-layered defense for our systems and your data, we have implemented a comprehensive set of security controls, policies, and procedures across all areas of our organization.
### Access & Authentication Control
* **Principle of Least Privilege**: Access to sensitive data and infrastructure is granted on a strict, need-to-know basis. We have a formal process for granting, reviewing, and revoking access rights.
* **Strong Authentication**: Multi-Factor Authentication (MFA) is mandatory for administrative access to all critical services, and we enforce strong password policies.
* **Regular Audits**: We maintain inventories of accounts and assets and conduct regular reviews of access permissions. Dormant accounts are promptly disabled.
### Data Protection & Encryption
* **End-to-End Encryption**: All customer data is encrypted in transit using strong TLS protocols and encrypted at rest using industry-standard AES-256 encryption.
* **Data Management**: We maintain a full data inventory and have established clear data management and retention policies to ensure your data is handled responsibly throughout its lifecycle.
* **Endpoint Security**: All end-user devices are equipped with anti-malware, firewalls, and full-disk encryption to protect data.
### Infrastructure & Network Security
* **Secure by Design**: Our infrastructure is deployed using Infrastructure-as-Code (IaC) principles, ensuring that our security configurations are version-controlled, auditable, and consistently applied.
* **Network Defenses**: We utilize a defense-in-depth strategy that includes Web Application Firewalls (WAF) and restrictive firewall rules to protect our public-facing infrastructure.
* **Continuous Monitoring**: Our infrastructure and network are continuously monitored for performance and security anomalies. We collect and analyze audit logs from all critical systems.
### Operational & Application Security
* **Secure Development**: We have established a secure software development lifecycle (SDLC), where all changes to our infrastructure and applications are logged and require peer review.
* **Vulnerability Management**: We conduct regular automated security scanning of our infrastructure and perform periodic penetration tests to identify and remediate vulnerabilities.
* **Disaster Recovery**: We have a robust business continuity and disaster recovery plan, which is tested regularly. Our backups are automated, isolated, and encrypted.
### People & Policy
* **Security Culture**: All employees undergo regular security awareness training and are bound by a code of conduct and confidentiality agreements.
* **Risk Management**: We perform regular risk assessments to proactively identify and mitigate potential threats to our platform and have a formal risk management policy in place.
* **Vendor Security**: We maintain a vendor management program to ensure that all third-party services meet our stringent security and compliance standards.
## Our Compliance Posture
We understand that our enterprise customers operate in regulated industries. That's why we've invested heavily in ensuring our platform meets the highest standards of security and compliance.
* **SOC 2 (Type I and Type II)**: We are currently undergoing both SOC 2 Type I and Type II audits, demonstrating our commitment to maintaining a secure and reliable platform. These audits, conducted by an independent third party, validate that our security controls are designed and operating effectively. To request a copy of our latest SOC 2 report, please contact our sales team.
* **HIPAA Compliance**: For our customers in the healthcare industry, we offer a HIPAA-compliant processing pipeline. We are prepared to sign a Business Associate Agreement (BAA) to ensure that any Protected Health Information (PHI) is handled in accordance with HIPAA's stringent security and privacy rules.
We are continuously working to improve our security posture and stay ahead of emerging threats. Our security program includes regular vulnerability scanning, penetration testing, and a dedicated security team to respond to any incidents.
Visit our [trust center](https://trust.chunkr.ai/) for more details. If you have any questions about our security practices or would like to discuss your specific security needs, please do not hesitate to contact us at [support@chunkr.ai](mailto:support@chunkr.ai).
### Our Subprocessors
To deliver our services, we partner with a select group of third-party vendors. Each subprocessor is vetted to ensure they meet our stringent security and privacy standards. The following table details these partners and the data they handle.
| Vendor | Service | Country |
| --------------------------- | ------------------------------ | ------------- |
| Amazon Web Services (AWS) | Cloud Infrastructure & Storage | United States |
| Microsoft Azure | Cloud Infrastructure | United States |
| Google Cloud Platform (GCP) | Cloud Infrastructure | United States |
| Cloudflare | Content Delivery Network | United States |
| OpenAI | AI Model Provider | United States |
| PostHog | Product Analytics | United States |
| SigNoz | Analytics & Monitoring | United States |
# Usage Limits
Source: https://docs.chunkr.ai/pages/task-system/limits
Task timeouts and file size restrictions.
Chunkr is designed for high-throughput processing.
In general there are minimal limits, but there are a few considerations that are discussed here.
## Task Timeout
All tasks have a **1-hour** timeout once processing begins. Tasks that exceed this timeout will automatically fail.
If you have a file that consistently times out, please [contact our support team](mailto:support@chunkr.ai) to discuss potential solutions.
***
## Rate Limiting
To ensure fair usage and system stability, we enforce a rate limit of **10 files per second**. If you exceed this limit, our API will respond with a `429 Too Many Requests` error.
Our SDKs are designed to handle this gracefully. They will automatically detect `429` errors and retry the request after a short backoff period.
***
## File Size Limits
There are no hard limits on the size of the files you can upload. However, there is one exception:
* **Base64 Uploads**: When uploading a file as a base64-encoded string, the total request size is limited to **1GB**. This is a practical limit to ensure reliable transfer over HTTP.
If you need to upload files larger than 1GB, we recommend using a URL.
```python Python theme={"system"}
from chunkr_ai import Chunkr
import os
client = Chunkr(api_key=os.environ['CHUNKR_API_KEY'])
# Recommended for large files - use URL directly
task = client.tasks.parse.create(file='https://chunkr.ai/very-large-file.pdf')
```
```typescript TypeScript theme={"system"}
import ChunkrAI from "chunkr-ai";
import fs from "fs";
const client = new ChunkrAI({ apiKey: process.env.CHUNKR_API_KEY! });
// Recommended for large files - use URL directly
const task = await client.tasks.parse.create({
file: "https://chunkr.ai/very-large-file.pdf",
});
```
***
## Page Limits
There is a soft limit of **2,000 pages** per single file. We accept larger files, but they may fail to process.
For documents exceeding this limit, please split the file into multiple parts before uploading.
# Task System Overview
Source: https://docs.chunkr.ai/pages/task-system/overview
Understanding Chunkr's task-based processing system
All processing in Chunkr is handled through a task-based system. When you submit a file, a new task is created, and you receive a `task_id`. Use this ID to get a task asynchronously.
This asynchronous approach allows you to submit long-running jobs without tying up your application. Once the task status is `Succeeded`, the full processing results are available.
## Key Features
* **Scalability**: Handle millions of files without tying-up your infrastructure.
* **[Broad File Support](/pages/get-started/file-types)**: Process a variety of file types like PDFs, Excel, PPTs, Doc.
* **[Multiple Input Sources](/pages/task-system/task-handling#supported-input-sources)**: Provide files from local path, from a URL, or as a base64-encoded string.
* **[Data Retention](/pages/task-system/task-handling#data-retention)**: Set custom expiration times for automatic data deletion.
* **[Webhook Support](/pages/task-system/webhooks/overview)**: Receive real-time notifications when tasks complete.
### Example: Upload a file, create a task, and get results
Here's how to upload, create a parse task, and retrieve results.
```python Python theme={"system"}
import os
import time
from chunkr_ai import Chunkr
# Initialize the client
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# 1. Upload a local file
with open("path/to/doc.pdf", "rb") as f:
uploaded = client.files.create(file=f)
# 2. Create a parse task using the uploaded file URL
parse_task = client.tasks.parse.create(file=uploaded.url)
print(f"Task created with ID: {parse_task.task_id}")
# 3. Wait for the task to complete
while not parse_task.completed:
print(f"Task status: {parse_task.status}")
time.sleep(3)
parse_task = client.tasks.parse.get(task_id=parse_task.task_id)
# 4. Access the results
if parse_task.status == "Succeeded" and parse_task.output is not None:
print("Task completed successfully!")
print(f"Document has {len(parse_task.output.chunks)} chunks")
else: # Could be "Failed" or "Cancelled"
print(f"Task status: {parse_task.status}")
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
import fs from "fs";
// Initialize the client
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY! });
// 1. Upload a local file
const fileStream = fs.createReadStream("path/to/doc.pdf");
const uploaded = await client.files.create({
file: fileStream,
file_metadata: JSON.stringify({ name: "doc.pdf", type: "application/pdf" }),
});
// 2. Create a parse task using the uploaded file URL
let task = await client.tasks.parse.create({ file: uploaded.url });
console.log(`Task created with ID: ${task.task_id}`);
// 3. Wait for the task to complete
while (!task.completed) {
task = await client.tasks.parse.get(task.task_id);
console.log(`Task status: ${task.status}`);
await new Promise((resolve) => setTimeout(resolve, 3000));
}
// 4. Access the results
if (task.status === "Succeeded") {
console.log("Task completed successfully!");
console.log(`Document has ${task.output?.chunks.length} chunks`);
} else { // Could be "Failed" or "Cancelled"
console.log(`Task status: ${task.status}`);
}
```
# Basic Task Handling
Source: https://docs.chunkr.ai/pages/task-system/task-handling
Create tasks, check their status, and retrieve processed results
## Create Task (upload-first)
Most workflows start by uploading a local file, then creating a task using the uploaded file URL.
* `client.files.create()`: Uploads a local file and returns a URL.
* `client.tasks.parse.create()`: Submits the uploaded file URL for processing.
```python Python theme={"system"}
import os
from chunkr_ai import Chunkr
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# Upload a local file first
with open("path/to/doc.pdf", "rb") as f:
uploaded_file = client.files.create(file=f)
# Create the task with the uploaded file URL
task = client.tasks.parse.create(file=uploaded_file.url)
print(f"Task created with ID: {task.task_id}")
print(f"Initial status: {task.status}") # "Starting" or "Processing"
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
import fs from "fs";
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY! });
// Upload a local file first
const fileStream = fs.createReadStream("path/to/doc.pdf");
const uploadedFile = await client.files.create({
file: fileStream,
file_metadata: JSON.stringify({ name: "doc.pdf", type: "application/pdf" }),
});
// Create the task with the uploaded file URL
const task = await client.tasks.parse.create({ file: uploadedFile.url });
console.log(`Task created with ID: ${task.task_id}`);
console.log(`Initial status: ${task.status}`); // "Starting" or "Processing"
```
### Supported Input Sources
You can provide a file via a URL, a local file (upload-first), or a base64-encoded string.
```python Python theme={"system"}
import base64
import os
from chunkr_ai import Chunkr
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# From a URL (if available)
task = client.tasks.parse.create(
file="https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/doc.pdf"
)
# Or, from a local file
with open("path/to/doc.pdf", "rb") as f:
uploaded_file = client.files.create(file=f)
task = client.tasks.parse.create(file=uploaded_file.url)
# OR from a base64 string
with open("path/to/doc.pdf", "rb") as f:
base64_string = base64.b64encode(f.read()).decode("utf-8")
task = client.tasks.parse.create(
file=f"data:application/pdf;base64,{base64_string}"
)
```
```typescript TypeScript theme={"system"}
import ChunkrAI from "chunkr-ai";
import fs from "fs";
const client = new ChunkrAI({ apiKey: process.env.CHUNKR_API_KEY! });
// From a URL (if available)
let task = await client.tasks.parse.create({
file: "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/doc.pdf",
});
// Or, from a local file
const fileStream = fs.createReadStream("path/to/doc.pdf");
const uploadedFile = await client.files.create({
file: fileStream,
file_metadata: JSON.stringify({
name: "doc.pdf",
type: "application/pdf",
}),
});
task = await client.tasks.parse.create({
file: uploadedFile.url,
});
// OR from a File object (browser)
const file = new File(["file contents"], "path/to/doc.pdf");
const uploaded = await client.files.create({
file,
file_metadata: JSON.stringify({ name: "doc.pdf" }),
});
task = await client.tasks.parse.create({
file: uploaded.url,
});
```
### Configuration
Most users can start without any configuration. If needed, you can set optional parameters like `expires_in` for data retention when creating a task. For advanced options, see [API Reference](/api-references/tasks/create-parse-task#response-configuration).
***
## Get Task
Retrieve information for any task using its `task_id`. There are several ways to get task results depending on your needs.
### Get Completed Task
For tasks that have already completed processing, you can retrieve the results immediately:
```python Python theme={"system"}
import os
from chunkr_ai import Chunkr
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# Get the task
task = client.tasks.parse.get(task_id="task_123")
# Access task info
print(f"Status: {task.status}")
if task.status == "Succeeded" and task.output is not None:
print(f"Chunks: {len(task.output.chunks)}")
for chunk in task.output.chunks[:5]:
if chunk.content is not None:
print(f"- {chunk.content[:100]}...")
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY! });
// Get the task
const task = await client.tasks.parse.get("task_123");
// Access task info
console.log(`Status: ${task.status}`);
if (task.status == "Succeeded") {
console.log(`Chunks: ${task.output?.chunks.length}`);
for (const chunk of task.output?.chunks.slice(0, 5) ?? []) {
console.log(`- ${chunk.content?.slice(0, 100)}...`);
}
}
```
### Robust Polling with Retry Logic
For tasks still processing, implement polling with retry logic using dedicated retry libraries for better error handling and exponential backoff.
We recommend using [tenacity](https://github.com/jd/tenacity) for python and [p-retry](https://github.com/sindresorhus/p-retry) for typescript.
```python Python theme={"system"}
import os
from chunkr_ai import Chunkr
from tenacity import retry, retry_if_result, stop_after_attempt, wait_fixed
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
@retry(
retry=retry_if_result(lambda result: not result.completed),
stop=stop_after_attempt(1500),
wait=wait_fixed(3),
)
def get_task(task_id):
task = client.tasks.parse.get(task_id=task_id)
print(f"Task ID: {task_id}, Status: {task.status}")
return task
# Get task with polling
task = get_task("task_123")
print(task.status) # Will be "Succeeded"
if task.status == "Succeeded" and task.output is not None:
print(f"Found {len(task.output.chunks)} chunks")
```
```typescript TypeScript theme={"system"}
import { Chunkr } from "chunkr-ai";
import pRetry from "p-retry";
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY! });
async function getTask(taskId: string) {
return await pRetry(
async () => {
const task = await client.tasks.parse.get(taskId, {
include_chunks: true,
});
console.log(`Task ID: ${taskId}, Status: ${task.status}`);
if (!task.completed) {
throw new Error(
`Task not completed yet. Current status: ${task.status}`
);
}
return task;
},
{
retries: 1500,
minTimeout: 3000,
maxTimeout: 3000,
onFailedAttempt: () => {},
}
);
}
// Get task with polling
const task = await getTask("task_123");
console.log(task.status);
if (task.status === "Succeeded") {
console.log(`Found ${task.output?.chunks?.length || 0} chunks`);
}
```
We recommend a large number of retries to ensure that the task completes successfully.
### Get Task with Base64-Encoded Assets
By default, Chunkr provides access to generated files (like images or PDF crops) via temporary pre-signed URLs that expire after 10 minutes. For long-term access, you can retrieve file assets as base64-encoded strings, which embeds the data directly in the task response.
Set `base64_urls=True` when fetching a task to get base64-encoded strings:
```python Python theme={"system"}
import os
from chunkr_ai import Chunkr
client = Chunkr(api_key=os.environ["CHUNKR_API_KEY"])
# Set base64_urls=True
# Assets are now embedded as base64 strings and won't expire
task = client.tasks.parse.get(task_id="task_123", base64_urls=True)
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
const client = new Chunkr({ apiKey: process.env.CHUNKR_API_KEY! });
// Set base64_urls=True
// Assets are now embedded as base64 strings and won't expire
const task = await client.tasks.parse.get("task_123", { base64_urls: true });
```
***
## Asynchronous Processing (Python)
For Python applications that require non-blocking operations, you can use the `AsyncChunkr` client instead of `Chunkr`.
The async client provides the exact same methods and parameters, but all operations are awaitable.
```python Python theme={"system"}
import asyncio
import os
from chunkr_ai import AsyncChunkr
from tenacity import retry, retry_if_result, stop_after_attempt, wait_fixed
@retry(
retry=retry_if_result(lambda result: not result.completed),
stop=stop_after_attempt(25),
wait=wait_fixed(3),
)
async def get_task(client: AsyncChunkr, task_id: str):
return await client.tasks.parse.get(task_id=task_id)
async def process_document():
client = AsyncChunkr(api_key=os.environ["CHUNKR_API_KEY"])
# Create task
task = await client.tasks.parse.create(
file="https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/doc.pdf"
)
print(f"Task created with ID: {task.task_id}")
# Get results
task = await get_task(client, task.task_id)
print(task.status)
if task.status == "Succeeded" and task.output is not None:
print(f"Processed {len(task.output.chunks)} chunks")
# Run with asyncio
asyncio.run(process_document())
```
**Key points about async processing:**
* Import `AsyncChunkr` instead of `Chunkr`
* Use `await` before all client method calls
* All method names and parameters remain exactly the same
* Perfect for applications already using `asyncio` or handling multiple concurrent operations
This means you don't need to learn a different API - just switch the client class and add `await` to your calls.
***
## Data Retention
While we store all outputs, original files, and image crops, you can use Chunkr solely as a processing engine.
For security and privacy, use the `expires_in` parameter to automatically delete all task data from Chunkr's servers after processing.
Here's an example config that sets the data to expire in 24 hours for Zero Data Retention. You would then use the get methods described above to retrieve your results before the data expires:
```python Python theme={"system"}
from chunkr_ai import Chunkr
client = Chunkr()
# Set expires_in for Zero Data Retention (ZDR)
task = client.tasks.parse.create(
file='https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/doc.pdf',
expires_in=24 * 60 * 60, # After 24 hours
)
```
```typescript TypeScript theme={"system"}
import Chunkr from "chunkr-ai";
const client = new Chunkr();
// Set expires_in for Zero Data Retention (ZDR)
const task = await client.tasks.parse.create({
file: "https://s3.us-east-1.amazonaws.com/chunkr-web/uploads/doc.pdf",
expires_in: 24 * 60 * 60, // After 24 hours
});
```
***
## Advanced Features
While creating and reading tasks are the most common operations, Chunkr also provides functionality for more advanced task management:
* **List Tasks**: View all your tasks with pagination, filtering, and sorting options.
* **Delete Tasks**: Permanently remove completed or failed tasks to clean up your workspace
* **Cancel Tasks**: Stop a queued task before it begins processing if it's no longer needed
For detailed information on these operations, see the [API references](/api-references/tasks/create-task).
# Example
Source: https://docs.chunkr.ai/pages/task-system/webhooks/examples
Quick, copy‑pasteable snippets for receiving Chunkr webhooks, verifying them with Svix, and then fetching the related task via the Chunkr SDK.
See the end‑to‑end flow in the Webhooks overview.
## Installation
Install the required packages:
```bash Python theme={"system"}
pip install chunkr-ai fastapi uvicorn svix
```
```bash TypeScript theme={"system"}
bun add chunkr-ai svix
```
## Python (FastAPI)
This minimal FastAPI handler verifies the webhook signature using Svix and then calls `chunkr.tasks.parse.get(task_id)`.
```python theme={"system"}
from fastapi import FastAPI, Request, Response, status
from svix.webhooks import Webhook, WebhookVerificationError
from chunkr_ai import Chunkr
import os
app = FastAPI()
CHUNKR_WEBHOOK_SECRET = (
os.environ.get('CHUNKR_WEBHOOK_SECRET') or ''
) # starts with "whsec_"
chunkr = Chunkr()
@app.post('/chunkr/webhooks', status_code=status.HTTP_204_NO_CONTENT)
async def webhook_handler(request: Request) -> Response:
headers = request.headers
payload = await request.body() # Verify against the raw body
try:
svix_id = headers.get('svix-id')
svix_timestamp = headers.get('svix-timestamp')
svix_signature = headers.get('svix-signature')
if not svix_id or not svix_timestamp or not svix_signature:
return Response(status_code=status.HTTP_400_BAD_REQUEST)
svix_headers = {
'svix-id': svix_id,
'svix-timestamp': svix_timestamp,
'svix-signature': svix_signature,
}
msg = Webhook(CHUNKR_WEBHOOK_SECRET).verify(payload, svix_headers) # dict
except WebhookVerificationError:
return Response(status_code=status.HTTP_400_BAD_REQUEST)
task_id = msg.get('task_id')
status_str = msg.get('status')
# Fetch only once the task is completed
if task_id and status_str == 'Succeeded':
_task = chunkr.tasks.parse.get(task_id)
if _task.output is not None:
# Do something with _task (e.g., persist results)
for chunk in _task.output.chunks:
print(chunk.content)
# No content response for webhook receivers
return Response(status_code=status.HTTP_204_NO_CONTENT)
```
Run the server locally:
```bash theme={"system"}
uvicorn main:app --reload --port 8000
```
## TypeScript (Bun)
This minimal Bun server verifies the webhook signature using Svix and then calls `chunkr.tasks.parse.get(task_id)`.
```typescript theme={"system"}
import Chunkr from "chunkr-ai";
import { Webhook } from "svix";
const CHUNKR_WEBHOOK_SECRET = process.env.CHUNKR_WEBHOOK_SECRET || ""; // starts with "whsec_"
const chunkr = new Chunkr();
// Webhook Endpoint
async function handleWebhook(request: Request): Promise {
try {
const headers = request.headers;
const payload = await request.text();
// Extract Svix headers
const svixId = headers.get("svix-id");
const svixTimestamp = headers.get("svix-timestamp");
const svixSignature = headers.get("svix-signature");
if (!svixId || !svixTimestamp || !svixSignature) {
return new Response("Missing required headers", { status: 400 });
}
const svixHeaders = {
"svix-id": svixId,
"svix-timestamp": svixTimestamp,
"svix-signature": svixSignature,
};
// Verify the webhook
const wh = new Webhook(CHUNKR_WEBHOOK_SECRET);
const msg = wh.verify(payload, svixHeaders) as any;
const taskId = msg.task_id;
const status = msg.status;
// Fetch only once the task is completed
if (taskId && status === "Succeeded") {
const task = await chunkr.tasks.parse.get(taskId);
if (task.output) {
// Do something with task (e.g., persist results)
const parseOutput = task.output;
if (parseOutput.chunks) {
for (const chunk of parseOutput.chunks) {
console.log(chunk.content);
}
}
}
}
// No content response for webhook receivers
return new Response(null, { status: 204 });
} catch {
return new Response("Bad Request", { status: 400 });
}
}
function startServer() {
const server = Bun.serve({
port: process.env.PORT || 8000,
async fetch(request) {
const url = new URL(request.url);
if (url.pathname === "/chunkr/webhooks" && request.method === "POST") {
return handleWebhook(request);
}
return new Response("Not Found", { status: 404 });
},
});
return server;
}
if (import.meta.main) {
startServer();
}
```
Run the server locally:
```bash theme={"system"}
bun run main.ts
```
# Overview
Source: https://docs.chunkr.ai/pages/task-system/webhooks/overview
## What are webhooks?
You can set up webhooks to receive real-time notifications from Chunkr whenever task events occur in your account. Ideal for bulk uploads and other high-volume workflows.
Webhooks are HTTPS POST requests sent to an endpoint you control. You can add and manage endpoints from the Chunkr dashboard.
You can use a single endpoint per service that listens to all event types, or restrict it to only the events you care about.
For example, you might receive webhooks from Chunkr at a URL like: `https://www.chunkr.ai/chunkr/webhooks/`.
Indicate that a webhook has been processed by returning a 2xx (status code 200-299) response to the webhook message within a reasonable time frame (15 seconds).
It's also important to disable CSRF protection for this endpoint if your framework enables it by default.
Another important aspect of handling webhooks is verifying the signature and timestamp when processing them.
You can learn more about this in the signature verification section.
## Events & payloads
Chunkr currently emits only 1 event:
* `task.parse.updated`: Webhook event payload for parse task updates
Example payload shape:
```javascript Task Parse Updated theme={"system"}
{
"event_type": "task.parse.updated",
"message": null,
"status": "Starting",
"task_id": "..."
}
```
You can see more information in the [API references](/api-references/task-parse-updated)
## Add an endpoint
To start listening to messages, you need to configure your endpoints.
Adding an endpoint is as simple as providing a URL that you control and selecting the event types that you want to listen to.
If you don't specify any event types, by default, your endpoint will receive all events, regardless of type.
This can be helpful for getting started and for testing, but we recommend changing this to a subset later on to avoid receiving extraneous messages.
If your endpoint isn't quite ready to start receiving events, you can click the "Svix Play" button to have a unique URL generated for you.
You'll be able to view and inspect webhooks sent to your Svix Play URL, making it effortless to get started.
To add your first endpoint, click the Webhooks section on the [Chunkr dashboard](https://www.chunkr.ai/dashboard)
## Test an endpoint
Once you've added an endpoint, you'll want to make sure it's working.
The "Testing" tab lets you send test events to your endpoint.
After sending an example event, you can click into the message to view the message payload, all of the message attempts, and whether it succeeded or failed.
### Local testing with Svix Play (CLI)
For local development, you can use [Svix Play](https://docs.svix.com/play) to relay a publicly accessible URL to your local server. Point your Chunkr webhook endpoint to the Play URL, and the CLI will forward requests to your localhost for easy debugging.
1. Install the CLI:
```bash macOS theme={"system"}
brew install svix/svix/svix
```
```bash Linux theme={"system"}
snap install svix
```
```bash Windows theme={"system"}
scoop bucket add svix https://github.com/svix/scoop-svix.git
scoop install svix
```
2. Start a relay to your local webhook handler:
```bash Start relay theme={"system"}
svix listen http://localhost:8000/webhook/chunkr
# -> Prints: https://play.svix.com/in// and a view URL
```
3. In the Chunkr dashboard, set your webhook endpoint URL to the printed Play URL. All messages sent to this URL will be forwarded to your local server.
4. Watch requests in the Svix Play UI (the CLI prints a "view" link) and in your local server logs.
## Verifying webhooks
It’s important to know whether a webhook has come from `chunkr.ai`, or a third party that might be trying to exploit a vulnerability.
To avoid this, we send a signature in the header of our webhooks, which you can verify using a signing secret.
To verify the signatures, see Svix's guide on how to verify webhooks with the [Svix libraries](https://docs.svix.com/receiving/verifying-payloads/how) or how to verify [webhooks manually](https://docs.svix.com/receiving/verifying-payloads/how-manual).
## Retries
We attempt to deliver each webhook message based on a retry schedule with exponential backoff.
### The schedule
Each message is attempted on the following schedule. Each period starts after the preceding attempt fails:
* Immediately
* 5 seconds
* 5 minutes
* 30 minutes
* 2 hours
* 5 hours
* 10 hours
* 10 hours (in addition to the previous)
If an endpoint is removed or disabled, delivery attempts to the endpoint will be disabled as well.
For example, an attempt that fails three times before eventually succeeding will be delivered roughly 35 minutes and 5 seconds following the first attempt.
### Manual retries
You can also use the application portal to manually retry each message at any time, or automatically retry ("Recover") all failed messages starting from a given date.
# Troubleshooting
Source: https://docs.chunkr.ai/pages/task-system/webhooks/troubleshooting
## Troubleshooting Tips
**Not using the raw payload body**
This is the most common issue. When generating the signed content, we use the raw string body of the message payload.
If you convert JSON payloads into strings using methods like `JSON.stringify`, different implementations may produce different string representations of the JSON object, which can lead to discrepancies when verifying the signature. It's crucial to verify the payload exactly as it was sent, byte-for-byte or string-for-string, to ensure accurate verification.
**Missing the secret key**
From time to time, we see people simply using the wrong secret key. Remember that keys are unique to endpoints.
**Sending the wrong response codes**
When we receive a response with a 2xx status code, we interpret that as a successful delivery even if you indicate a failure in the response payload. Make sure to use the correct response status codes so we know when messages are supposed to succeed or fail.
**Responses timing out**
We will consider any message that fails to send a response within the configured timeout a failed message. If your endpoint is also processing complicated workflows, it may time out and result in failed messages.
We suggest having your endpoint simply receive the message and add it to a queue to be processed asynchronously so you can respond promptly and avoid timing out.
## Failure Recovery
**Re-enable a disabled endpoint**
If all attempts to a specific endpoint fail for a period of 5 days, the endpoint will be disabled. To re-enable a disabled endpoint, go to the webhook dashboard, find the endpoint from the list and select "Enable Endpoint".
**Recovering and resending failed messages**
If your service has downtime or if your endpoint was misconfigured, you probably want to recover any messages that failed during the downtime.
If you want to replay a single event, you can find the message in the UI and click the options menu next to the attempt.
From there, click "Resend" to have the same message sent to your endpoint again.
If you need to recover from a service outage and want to replay all the events since a given time, you can do so from the endpoint page. On an endpoint's details page, click "Options > Recover Failed Messages".
From there, you can choose a time window to recover from.
For a more granular recovery - for example, if you know the exact timestamp that you want to recover from - you can click the options menu on any message from the endpoint page.
From there, you can click "Replay..." and choose to "Replay all failed messages since this time."