Skip to main content

High-Level Structure

Parse returns a Task object. When processing is successful, the output field contains the HTML/Markdown representation of your document. The core of this output is a list of chunks, which are composed of individual segments.
{
  "task_id": "8b7e7e8a-...",
  "status": "Succeeded",
  "output": {
    "file_name": "document.pdf",
    "page_count": 2,
    "chunks": [
      // ... array of chunk objects
    ],
    "pages": [
      // ... array of page objects with full-page images
    ]
  },
  // ... other task metadata
}

Chunks and Segments

The document is first broken down into segments, which represent individual semantic elements like a paragraph, table, or title. These segments are then grouped into chunks based on your chunking configuration.
  • Segments: The smallest building blocks. Each segment corresponds to a single, identified element from the source document.
  • Chunks: A logical grouping of one or more segments. Each chunk includes chunk_length, content, embed, and segments[]. For RAG applications, chunks are the units of information that are typically embedded and retrieved.
The supported segment types are: Caption, Footnote, Formula, FormRegion, GraphicalItem, Legend, LineNumber, ListItem, Page, PageFooter, PageHeader, PageNumber, Picture, Table, Text, Title, Unknown, and SectionHeader.
By default, no segments are ignored. To change this behaviour you can adjust your configuration to ignore specific segments like headers and footers.

Key Output Fields

Each segment object contains rich information. At the chunk level, corresponding fields are concatenated from all of the segments within that chunk. Here are the most important fields:

1. content

The content field holds the primary, structured representation of the segment. Each segment is formatted based on it’s type.
  • Tables: Converted to HTML to maintain complex col/row-span structure.
  • Images: Converted to a robust markdown description, with charts/graphs including a tabular representation.
  • Forms and Legends: Converted into a key-value markdown table.
  • Formulas: Converted to LaTeX strings for perfect mathematical representation. They can even be embedded within an HTML table if a formula appears inside a cell.
  • Text-type: Text-heavy segments like title, section headers, list-items and text blocks are converted into markdown.
{
  "segment_type": "Table",
  "content": "<table><tr><td>The formula is:</td><td>\\( E=mc^2 \\)</td></tr></table>"
}

2. embed

The embed field provides the clean, RAG-optimized text that should be used for generating embeddings.
  • It includes the content and, if present, the description. This helps optimize the table segments without contaminating the content field.
  • This is the field used for calculating token counts when chunking, ensuring chunks fit your target length.
{
  "chunk_id": "chunk-1-...",
  "chunk_length": 45,
  "embed": "The table shows a 15% increase in Q2 revenue for Widget A... | Product | Q1 | Q2 | ...",
  "segments": [ /* ... */ ]
}

3. bbox (Bounding Box)

Every segment includes a precise bounding box (bbox) that pinpoints its exact location on the original page. This is essential for building applications that require citations or highlighting.
  • The coordinates (left, top, width, height) are pixel (px) values in the page coordinate space. For resolution‑independent rendering, normalize them using the page dimensions — for example, left_pct = left / page_width, top_pct = top / page_height, and likewise for width/height. Use page_width/page_height on the segment (or the page’s pg_width/pg_height).
  • The dpi in the pages array describes the pixel resolution of the rendered page image. You do not need dpi when using normalized percentages; it is helpful only when mapping directly to a specific raster image in pixels or when generating images at a different scale.
{
  "segment_type": "Text",
  "bbox": { "left": 100, "top": 250, "width": 500, "height": 50 },
  "page_number": 1,
  "page_height": 1584,
  "page_width": 1224,
  ...
}

Spreadsheet-Specific Outputs (ss_*) Preview

This feature is in preview. Occasionally, extremely large spreadsheets can fail. In that case, we still return HTML, but layout analysis is not performed.

A financial spreadsheet showing intelligent segmentation of tables and charts.

When processing spreadsheet files (.xlsx, .xls), the output includes additional ss_* prefixed fields that provide native Excel context. These fields exist alongside the standard content, embed, and bbox fields, enriching each segment with its precise location and native data from the original spreadsheet.

Key Fields

  • ss_range: The cell range for the segment in A1 notation (e.g., A1:D10).
  • ss_cells: A detailed array of each cell in the segment, including its original formula, value, text and styling. Allows you to see both the raw formula (=SUM(B2:B10)) and its calculated result ($55,000).
  • ss_header_*: Fields identifying the detected header for a table, such as ss_header_range, ss_header_text, ss_header_bbox, and ss_header_ocr. Headers are intelligently associated even if they are not directly adjacent to the table.
These spreadsheet-native values unlock powerful capabilities:
  • Create Interactive Experiences: Use ss_range to build native citation experiences that let users click data and jump to the precise source cell in a viewer.
  • Get Cleaner LLM Context: Combine layout analysis with precise cell data to identify tables, associate headers, and filter out irrelevant cells. This provides cleaner, more meaningful context for LLM processing.
  • Build Powerful Spreadsheet Agents: Use the ss_* fields to build AI agents that can read, analyze, and even write back to spreadsheets. Understanding cell formulas and values enables agents to automate tasks like updating financial models, correcting entries, or adding new rows.

Looking for additional output fields? See the Advanced Outputs section below for more metadata options.
Beyond the key fields discussed above, the Parse output is enriched with a variety of other useful metadata at the file, page, and segment levels. Here are some of the most valuable advanced fields:
  • Word-Level Bounding Boxes: Included in the ocr array for each page, this provides the precise coordinates for every single word detected by the OCR process. This is ideal for building applications that require highlighting specific words or phrases in a document viewer.
  • Cropped Segment Images: Each segment object contains an image field with a URL to a cropped image of just that segment. This is incredibly useful for providing visual context to an LLM or displaying the source of a specific chunk of text.
  • File & Page Metadata: The top-level output object contains file-level metadata like the original file_name, mime_type, and page_count. Additionally, the pages array contains detailed information for each page, including a full-page image URL, dimensions (page_width, page_height), and DPI.
For a comprehensive breakdown of every field available in the output, please refer to our API Reference.