Skip to main content

High-Level Structure

Extract returns a Task object. When processing is successful, the output field contains your custom JSON schema—fully populated with extracted values. Alongside your results, the output includes citations and metrics that map back to each field in your schema using field paths.
  • results: Your exact JSON schema structure, filled with extracted data.
  • citations: A mapping from each field path (e.g., vendor.vendor_name, line_items[0].item_description) to an array of citation objects. Each citation contains the source content, page_number, and precise bboxes[] that pinpoint where the value was found in the document.
  • metrics: A mapping from each field path to a metrics object containing a confidence score (High or Low) and citation_status, helping you identify which extractions may need review.

How Field Path Mapping Works

Field paths use dot notation for nested objects and bracket notation for arrays:
  • Top-level fields: invoice_number, total_amount
  • Nested object fields: vendor.vendor_name, vendor.contact_email
  • Array item fields: line_items[0].item_description, line_items[1].unit_price
These paths act as keys in both the citations and metrics objects, allowing you to programmatically link each extracted value back to its source location and confidence level.
{
  "task_id": "extract-8b7e7e8a-...",
  "status": "Succeeded",
  "output": {
    "results": {
      "invoice_number": "INV-2024-001",
      "invoice_date": "2024-03-15",
      "due_date": "2024-04-15",
      "vendor": {
        "vendor_name": "Acme Corp",
        "vendor_id": "ACME-001",
        "contact_email": "[email protected]",
        "phone_number": "+1-555-123-4567",
        "address": "1 Acme Way, Metropolis, NY 10001"
      },
      "line_items": [
        { "item_description": "Widget A", "quantity": 10, "unit_price": 12.5, "line_total": 125.0 },
        { "item_description": "Widget B", "quantity": 4, "unit_price": 50.0, "line_total": 200.0 }
      ],
      "subtotal": 325.0,
      "tax_amount": 26.0,
      "total_amount": 351.0,
      "payment_terms": "Net 30"
    },
    "citations": {
      "invoice_number": [
        {
          "citation_type": "Segment",
          "content": "Invoice # INV-2024-001",
          "page_number": 1,
          "page_width": 792,
          "page_height": 612,
          "bboxes": [ { "left": 450, "top": 120, "width": 100, "height": 20 } ]
        }
      ],
      "vendor.vendor_name": [ 
        { 
          "citation_type": "Segment", 
          "content": "Acme Corp", 
          "page_number": 1, 
          "page_width": 792,
          "page_height": 612,
          "bboxes": [ { "left": 100, "top": 200, "width": 200, "height": 18 } ] 
        } 
      ],
      "line_items[0].item_description": [
        {
          "citation_type": "Segment",
          "content": "Widget A",
          "page_number": 1,
          "page_width": 792,
          "page_height": 612,
          "bboxes": [ { "left": 100, "top": 350, "width": 80, "height": 15 } ]
        }
      ]
      // ... other field-level citations
    },
    "metrics": {
      "invoice_number": { "confidence": "High", "citation_status": "Created" },
      "vendor": {
        "vendor_name": { "confidence": "High", "citation_status": "Created" },
        "contact_email": { "confidence": "Low", "citation_status": "Created" }
      },
      "total_amount": { "confidence": "High", "citation_status": "Created" }
    }
  }
}

Key Output Fields

1. results

The results object contains your extracted data, structured exactly according to the JSON schema you provided. Every field, nested object, and array element follows your schema definition, making it simple to integrate into your application logic.

2. citations

Citations provide full traceability for each extracted value. Each key in the citations object is a field path that maps to an array of citation objects. A single field can have multiple citation types depending on the document source and extraction granularity. Citation Granularity:
  • Segment-level citations are always provided. These reference semantic elements like paragraphs, tables, or text blocks that support the extracted value.
  • Word-level citations may also be included for finer-grained traceability. When word-level citations are present, segment-level citations will also be included in the same array.
  • Range citations (spreadsheets only) provide cell range information in Segment and Word-level citations when extracting from .xlsx or .xls files.
Citation Object Fields:
  • citation_id: Unique identifier for this citation.
  • citation_type: The citation granularity: "Segment" or "Word".
  • content: The content supporting the extraction. For Segment citations, this is the HTML/Markdown content from the Parse output (e.g., HTML for a table). For Word citations, it’s the raw OCR text.
  • segment_type: The type of segment (e.g., "Text", "Table", "Title"). Only present for segment-level citations.
  • segment_id: Identifier linking back to the original segment from Parse. Only present for segment-level citations.
  • page_number: The page where the citation appears.
  • page_width, page_height: Page dimensions in pixels for normalizing bounding boxes.
  • bboxes: Array of bounding box objects ({ left, top, width, height } in pixels) pinpointing the exact location(s) on the page.
  • ss_ranges: Array of cell ranges in A1 notation (e.g., ["A1:C10"]). Only present for spreadsheet citations.
  • ss_sheet_name: The sheet name where the data was found. Only present for spreadsheet citations.
{
  "citation_id": "abc1234",
  "citation_type": "Segment",
  "content": "Invoice # INV-2024-001",
  "segment_id": "seg_001",
  "segment_type": "Text",
  "page_number": 1,
  "page_height": 612,
  "page_width": 792,
  "bboxes": [
    { "left": 450, "top": 120, "width": 100, "height": 20 }
  ]
}
Citations enable powerful use cases:
  • Document Viewers: Highlight the exact source text when a user clicks on an extracted field.
  • Validation Workflows: Let human reviewers verify extracted values against their original context.
  • Audit Trails: Track which parts of a document contributed to each data point.
  • Spreadsheet Navigation: Jump directly to source cells in spreadsheet viewers using ss_ranges.

3. metrics

The metrics object mirrors your schema structure and provides metrics for each extracted field. It contains:
  • confidence: High or Low, indicating if the value is supported by citations.
  • citation_status: Created, Failed, or Skipped, indicating the status of citation generation.
Confidence values are currently experimental. Treat them as a heuristic for ranking rather than a definitive measure of correctness. We recommend using them conservatively and pairing scores with citation reviews.
Looking for a complete schema of all output fields? See our API Reference.