Skip to main content

High-Level Structure

Extract returns a Task object. When processing is successful, the output field contains your custom JSON schema—fully populated with extracted values. Alongside your results, the output includes citations and metrics that mirror your schema and can be addressed using field paths.
  • results: Your exact JSON schema structure, filled with extracted data.
  • citations: Mirrors the results shape. At every leaf field (a primitive value) you will find an array of citation objects. For arrays of primitives, citations is an array where each element holds the citation array for that index or null when no citations were created for that element.
  • metrics: Mirrors the results shape. At every leaf field, you will find a metrics object. For arrays of primitives, metrics is an array where each element holds the metrics object for that index.

Primitives and Final Items

In this documentation, a primitive is any JSON value that is one of: null, boolean, number, or string. We consider a field a final item when it holds a primitive value. For arrays of primitives, each element is treated as its own final item (e.g., tags[0], tags[1]). Citations and metrics are generated at the level of each final item.

How Field Paths Work

Field paths use dot notation for nested objects and bracket notation for arrays:
  • Top-level fields: invoice_number, total_amount
  • Nested object fields: vendor.vendor_name, vendor.contact_email
  • Array item fields: line_items[0].item_description, line_items[1].unit_price
These paths can be used to address values within both the citations and metrics objects (which mirror results), allowing you to programmatically link each extracted value back to its source location and confidence level.
{
  "task_id": "extract-8b7e7e8a-...",
  "status": "Succeeded",
  "output": {
    "results": {
      "invoice_number": "INV-2024-001",
      "invoice_date": "2024-03-15",
      "due_date": "2024-04-15",
      "vendor": {
        "vendor_name": "Acme Corp",
        "vendor_id": "ACME-001",
        "contact_email": "[email protected]",
        "phone_number": "+1-555-123-4567",
        "address": "1 Acme Way, Metropolis, NY 10001"
      },
      "line_items": [
        { "item_description": "Widget A", "quantity": 10, "unit_price": 12.5, "line_total": 125.0 },
        { "item_description": "Widget B", "quantity": 4, "unit_price": 50.0, "line_total": 200.0 }
      ],
      "subtotal": 325.0,
      "tax_amount": 26.0,
      "total_amount": 351.0,
      "payment_terms": "Net 30",
      "tags": ["Overdue", "International"]
    },
    "citations": {
      "invoice_number": [
        {
          "citation_type": "Segment",
          "content": "Invoice # INV-2024-001",
          "segment_type": "Text",
          "page_number": 1,
          "page_width": 792,
          "page_height": 612,
          "bboxes": [ { "left": 450, "top": 120, "width": 100, "height": 20 } ]
        }
      ],
      "vendor": {
        "vendor_name": [
          {
            "citation_type": "Segment",
            "content": "Acme Corp",
            "segment_type": "Text",
            "page_number": 1,
            "page_width": 792,
            "page_height": 612,
            "bboxes": [ { "left": 100, "top": 200, "width": 200, "height": 18 } ]
          }
        ]
      },
      "line_items": [
        {
          "item_description": [
            {
              "citation_type": "Segment",
              "content": "Widget A",
              "segment_type": "Text",
              "page_number": 1,
              "page_width": 792,
              "page_height": 612,
              "bboxes": [ { "left": 100, "top": 350, "width": 80, "height": 15 } ]
            }
          ]
        },
        {
          "item_description": [
            {
              "citation_type": "Segment",
              "content": "Widget B",
              "segment_type": "Text",
              "page_number": 1,
              "page_width": 792,
              "page_height": 612,
              "bboxes": [ { "left": 100, "top": 370, "width": 80, "height": 15 } ]
            }
          ],
          "quantity": [
            {
              "citation_type": "Segment",
              "content": "4",
              "segment_type": "Text",
              "page_number": 1,
              "page_width": 792,
              "page_height": 612,
              "bboxes": [ { "left": 400, "top": 370, "width": 30, "height": 15 } ]
            }
          ]
        }
      ],
      "tags": [
        [
          {
            "citation_type": "Segment",
            "content": "Overdue",
            "segment_type": "Text",
            "page_number": 1,
            "page_width": 792,
            "page_height": 612,
            "bboxes": [ { "left": 120, "top": 160, "width": 80, "height": 16 } ]
          }
        ],
        [
          {
            "citation_type": "Segment",
            "content": "International",
            "segment_type": "Text",
            "page_number": 1,
            "page_width": 792,
            "page_height": 612,
            "bboxes": [ { "left": 210, "top": 160, "width": 120, "height": 16 } ]
          }
        ]
      ]
    },
    "metrics": {
      "invoice_number": { "confidence": "High", "citation_status": "Created" },
      "vendor": {
        "vendor_name": { "confidence": "High", "citation_status": "Created" },
        "contact_email": { "confidence": "Low", "citation_status": "Created" }
      },
      "line_items": [
        {
          "item_description": { "confidence": "High", "citation_status": "Created" },
          "quantity": { "confidence": "High", "citation_status": "Created" }
        },
        {
          "item_description": { "confidence": "High", "citation_status": "Created" },
          "quantity": { "confidence": "High", "citation_status": "Created" }
        }
      ],
      "tags": [
        { "confidence": "High", "citation_status": "Created" },
        { "confidence": "High", "citation_status": "Created" }
      ],
      "total_amount": { "confidence": "High", "citation_status": "Created" }
    }
  }
}

Key Output Fields

1. results

The results object contains your extracted data, structured exactly according to the JSON schema you provided. Every field, nested object, and array element follows your schema definition, making it simple to integrate into your application logic.

2. citations

Citations provide full traceability for each extracted value. The citations object mirrors your results structure. At each leaf (primitive value), you will find an array of citation objects. For arrays of primitives, citations is an array where each element holds the citation array for that index (e.g., tags[0], tags[1]) or null if no citations were created for that element. A single field can have multiple citation types depending on the document source and extraction granularity. Citation Granularity:
  • Segment-level citations are always provided. These reference semantic elements like paragraphs, tables, or text blocks that support the extracted value.
  • Word-level citations may also be included for finer-grained traceability. When word-level citations are present, segment-level citations will also be included in the same array.
For spreadsheets, we also provide cell range and sheet name information in Segment and Word-level citations inside the ss_ranges and ss_sheet_name fields.
Citation Object Fields:
  • citation_id: Unique identifier for this citation.
  • citation_type: The citation granularity: "Segment" or "Word".
  • content: The content supporting the extraction. For Segment citations, this is the HTML/Markdown content from the Parse output (e.g., HTML for a table). For Word citations, it’s the raw OCR text.
  • segment_type: The type of segment (e.g., "Text", "Table", "Title"). Only present for segment-level citations.
  • segment_id: Identifier linking back to the original segment from Parse. Only present for segment-level citations.
  • page_number: The page where the citation appears.
  • page_width, page_height: Page dimensions in pixels for normalizing bounding boxes.
  • bboxes: Array of bounding box objects ({ left, top, width, height } in pixels) pinpointing the exact location(s) on the page.
  • ss_ranges: Array of cell ranges in A1 notation (e.g., ["A1:C10"]). Only present for spreadsheet citations.
  • ss_sheet_name: The sheet name where the data was found. Only present for spreadsheet citations.
{
  "citation_id": "abc1234",
  "citation_type": "Segment",
  "content": "Invoice # INV-2024-001",
  "segment_id": "seg_001", // Only present for segment citations
  "segment_type": "Text",
  "page_number": 1,
  "page_height": 612,
  "page_width": 792,
  "bboxes": [
    { "left": 450, "top": 120, "width": 100, "height": 20 }
  ],
  "ss_ranges": ["D15:E20"], // Only present if file is a spreadsheet
  "ss_sheet_name": "Invoice" // Only present if file is a spreadsheet
}
Citations enable powerful use cases:
  • Document Viewers: Highlight the exact source text when a user clicks on an extracted field.
  • Validation Workflows: Let human reviewers verify extracted values against their original context.
  • Audit Trails: Track which parts of a document contributed to each data point.
  • Spreadsheet Navigation: Jump directly to source cells in spreadsheet viewers using ss_ranges.

3. metrics

The metrics object mirrors your schema structure and provides metrics for each extracted field. It contains:
  • confidence: High or Low, indicating if the value is supported by citations.
  • citation_status: Created, Failed, or Skipped, indicating the status of citation generation.
Looking for a complete schema of all output fields? See our API Reference.